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FOREWORD 


Reavers WILL NOTE a number of changes in style in the physical make-up 
of this issue of the Review. For suggestions leading to several of these 
changes we are indebted to Frank W. Hubbard. The Editorial Board, 
after careful consideration, has decided to place the bibliography for each 
chapter at the end of the chapter rather than at the back of the issue, as 
heretofore. Although there are certain advantages in the old position, 
it is believed that readers will find the new placement a convenience. The 
list of issues already published has been arranged on the inside back cover 
according to topic, and the forthcoming issues for the next year are given. 


Douc.as E. ScaTes 
Chairman of the Editorial Board 











INTRODUCTION 


ry. 

Dive TERM “psychological tests” as used in this Review includes tests of 
general academic aptitude, tests of aptitude in special flelds, and instru 
ments for the appraisal of personality and character, The seope of this 
issue is similar to that of previous issues on this topic, The main differ 
ence between the outline for the present number and that for June Lose 
is that testa of infants and young children have been absorbed in the 
other chapters, and a chapter on projective methods in personality study 
has been added, The appearance of a chapter on projective methods is 
the natural result of increased research activity in that area during the 
last two or three years, 

It should be understood that the bibliography does not list all the tn 
vestigations concerned with paychological teste that were published during 
the period under consideration, Space limitations made careful selection 
a necessity, American research was emphasized, although pertinent in 
vestigations in other countries were not neglected, 

The rather voluminous bibliography on tests of extrasensory perception 
might logically have been included in this issue, However, both because 
the space was limited and because the relationship of such tests to educa 
tion seems remote at present, a decision was made not to review them in 
this number, 

It would be highly desirable for a review of this kind to include not 
only descriptive summaries but also critical evaluations of the different 
investigations, When several hundred studies ave covered within a space 
of little more than a hundred pages, however, there is scant room for the 
detailed treatment which adequate criticism requires, Consequently thie 
review, like most summaries of research, is mainly descriptive and factual 
rather than critical and evaluative, although there is a certain amount of 
evaluation with respect to the general procedures used, 


Anruun b, Thaxien, Chairman 
Commitiee on Paychological Tests 
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CHAPTER 1 


Brief Overview of the Period 
ANMLTILM b LMA EE 


Recent Test Summarios and Hibliographios 


‘ 
Tw CONTHIBUTIONS of research to the canstruotion and wee al proyetin 
Lange te al testa have become so numerous that it is doubthal if any one in 
dividual, however industrious and ethoient, could hoop up with all af 
them unless he devoted hie full time to the task, Tie aot DU pre debits that 
under these ciroumetances summarios and bibliographios dealing with 
Hieasutrement have appeared trequently iit Peoent yeate 

During the period with which this review ie concerned several valuable 
reviews of research, pertaining wholly or tn part to peyehologioal teets 
were made available, The June L940 number of the Meview or bone 
PiONAL Heskanet, prepared by @ committee uncer the Chaitiianebiipe ol 
Symonds (1), reviewed the literature on poyolhologioal teste far the period 
January 1945 to January LOAM, Greene and others (4) prepared the De 
cember LOAN issue Covering educational teats trom July 1U45 to duly bose 
\ briel summary of recent research on intelligence, aptitude, poroonality 
and achievement tests was written by Much and Ovata (7) tor the Dey 
cember 1989 umber of the Mevirw, Wateon (10) jHeoonted ab overviow 
of the historical development and inplications of titelligence, aptitude 
and personality testing in the Thirty Seventh Veathook of the National 
Society for the Study of Pducation, Hildveth (5) tevioesd and bacugebit uy) 
to date her wellknown bibliography of mental teste and rating eoalee 
Huros ‘i. v) perlormned a Valuable serving hy prealet boda itage hie Nineteen 
Thirty hight and Nineteen harty Mental Measurements ear houks in 
which hundreds of new tests were discussed hy reviewers, inany al whan 
made reference to research data in their comments: Wang (9) prepared 
an annotated bibliography of mental teste and tating ecalee containing 
1,776 items, Some texthooks on measurement, such as Preeman's Mental 
Tests (4), contain a large amount of pertinent information about reaeareh 
relating lo pooyetierbeape de al teste, Hildveth’s book ie the diet Coniprelenoive 
sltigle bibliography of teste; Huros’ yeathooke provide the moet detailed 
and thorough appraisal of new tests; Wang's volume io the miaet extensive 
annotated list of teste of intelligens é, special aptitude, anil petounality 
Other bibliographios are given in Chapter V 


Survey of Hecent Trends in Peychological Measurement 


Although there have been a lew individual contributions that LL 
lo exemplify new approaches ty the measurement of intelligence, aptituds 


* Bibliography tar thie chapter begins oh page 6 
erawhy t e vee 
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or personality, no wholly new trends are discernible in the period under 
review. Rather, there has been an extension and intensification of certain 
trends begun earlier. The preparation of the Terman-Merrill revision of 
the Binet test, the new revision of the Kuhlmann test, and the Bellevue 
Intelligence Scales, as described in Chapter II, have given added impetus 
to the use of individual intelligence tests of the verbal type. Improved 
group tests for obtaining IQ’s have also been made available, but the 
tendency in recent group tests has been in the direction of breaking down 
the total score, mental age, or IQ itito two or more aspects of mental 
ability. Thus, we have such new tests as the California Test of Mental 
Maturity, which purports to measure language and nonlanguage factors; 
the American Council Psychological Examination, 1938, 1939, and 1940 
editions, which yields scores for linguistic ability and quantitative ability 
in addition to the total score; and Thurstone’s Primary Mental Abilities 
Tests, which provide scores for seven “factors.” The guidance values of 
none of these new scores have yet been established by research, although 
various persons have begun an attack upon this important problem. 

Perhaps the most important, and certainly the most widely publicized 
recent application of intelligence tests, has been in connection with the 
revival of the nature-nurture controversy. The investigation and animated 
discussion of the influence of nursery school training and foster-home en- 
vironment on the IQ, as exemplified in the Thirty-Ninth Yearbook of the 
National Society for the Study of Education and in various independent 
articles and monographs, have provided much new data concerning the 
question of the relative influence of heredity and environment on intelli- 
gence. This recent work has not, however, provided any entirely new 
approaches to the question nor has it fundamentally changed the point 
of view that most psychologists have held for many years. Few serious 
students of mental development have ever insisted that the IQ is dependent 
wholly on either hereditary or environmental factors. The question has 
always been one concerning the relative weight to be attributed to hered- 
ity and environment. The renewal of the debate over the question has 
performed an important service in reminding test users that shifts in IQ 
are to be expected under certain conditions and that interpretations of 
the results of mental tests should be made in the light of the nature of the 
measuring instruments used and of the whole background and history of 
the individual. 

There has been considerable activity in the preparation, appraisal, and 
application of aptitude tests in a variety of fields. These tests are of great 
interest to counselors, personnel workers with potential employees, and 
to employers who want to increase the efficiency of their organizations. 
As Segel has pointed out in Chapter IV, a test that is valid for workers 
on the job may not be valid for guidance into an occupation, and vice 
versa. 

Probably the most pronounced trends in personality measurement dur- 
ing the period have been the increased application of factor analysis and 
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BrigEF OVERVIEW OF THE PERIOD 


the development of projective methods. Factor analysis was mentioned 
as a new technic of personality study in the June 1935 number of the 
Review and a few studies were listed; a somewhat larger number of in- 
vestigations employing factor analysis was reported in the June 1938 issue, 
but the total number was still small. Since that time there has been so 
marked an increase in such studies that the use of factor analysis may 
now be listed as one of the major technics of personality measurement. 

The fact that so many reputable leaders in psychometrics have recently 
turned their attention to the measurement of personality and character 
augurs well for future progress in this field. It seems clear, however, that 
there is a fundamental difficulty with measurement in this area which has 
not been removed, and probably cannot be removed simply by the per- 
fection of mathematical technic in the treatment of the data. Nearly all 
the leading personality tests continue to belong to the self-inventory type. 
The basic weakness of that type of test, that of eliciting truthful and de- 
pendable responses, requires no comment. It is probably this limitation 
in personality inventories as much as any other influence that has led to 
the growth of projective methods of appraisal, which are reviewed in 
Chapter VI. 

Notwithstanding continued activity in the production of tests of per- 
sonality and character, Rothney and Roens report in Chapter VII that 
there seems to be a decline in the application of tests in this area. This 
may indicate that potential users of personality tests have grown more 
critical and are insisting that the tests themselves first be evaluated care- 
fully before they are applied on a service basis. Few of the several hun- 
dred available tests in this field can be recommended for anything more 
than trial and experimentation at the present time. 

The marked increase in machine scoring since the advent of the Inter- 
national Test Scoring Machine in 1935 is exerting a gradual but inexorable 
influence on the form and the administration of tests in all fields. What 
was formerly the most frequently used semiobjective type of test item— 
the completion question—is being forced out of existence. There can no 
longer be any compromise with objectivity; all tests that are to be ma- 
chine scored must be completely objective. Suggestions for adapting tests 
to machine scoring have been given in an article by Koran (6) and in 
various publications of the International Business Machine Corporation. 
One large noncommercial test publisher, the Cooperative Test Service, 
has recognized from the beginning the importance of machine scoring and 
has adapted nearly all its tests so that they could be scored by this pro- 
cedure as well as manually; certain other publishers are assuming a sim- 
ilar point of view. The recent addition of an item analysis unit to the 
scoring machine increases its significance for research work. 

On the whole, the newer psychological tests have been more carefully 
constructed than most of those prepared a few years ago. A test of what- 
ever kind, however, is worth the serious attention of school people only 
if it has practical meaning—only if it yields scores that have values for 
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educational or vocational guidance or for the clinical analysis and treat- 
ment of a weakness of some sort. The greatest single need for research on 
psychological tests appears to be the discovery of the relationships be- 
tween scores and success in educational and vocational fields, or between 
scores and adjustment as determined by dependable criteria. This calls 
for a variety of longitudinal and follow-up studies. It also calls for co- 


operation, for no one individual and probably no single organization can 
do the job alone. 
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CHAPTER II 


Current Construction and Evaluation of Intelligence 
Tests’ 


DEWEY B. STUIT 


Tue CONSTRUCTION of new mental tests, both group and individual, dur- 
ing the past three years has been limited. Revisions of older tests have 
been more numerous. In a sense it is hardly possible to construct a dis- 
tinctly new test at this stage in the history of mental measurement. The 
situation is somewhat analogous to the field of automotive engineering. 
The automobile is constantly being improved but to date has not been 
replaced with a radically different mode of transportation. Likewise, test 
construction now is largely a matter of refining present instruments by 
new and old statistical methods, discarding weak items or sections, im- 
proving ease of administration and interpretation, and providing better 
norms. The time may come when we shall see an entirely new technic in 
mental measurement but to date it has not appeared. 


Individual Intelligence Tests (Verbal) 
New and Revised Tests 


The measurement of adult intelligence has persisted as one of the major 
problems in psychological testing. To fill this need Wechsler (75) con- 
structed the Bellevue Intelligence Scales. The most distinctive feature of 
these tests is that the average performance of individuals in any age group 
is taken as the point of reference for that age group. The net result of this 
procedure is that older subjects generally receive higher IQ’s on the Belle- 
vue Scales than they do in other tests of general intelligence. In general, 
the test results have agreed remarkably well with clinical judgments. The 
ten subtests, five verbal and five nonverbal, are as follows: Information 
Test, General Comprehension, Combined Memory Span Test for Digits 
Forwards and Backwards, Similarities Test, Arithmetical Reasoning, Pic- 
ture Arrangement Test, Picture Completion Test, Block Design Test, Object 
Assembling Test, and Digit Symbol Test (alternate—a vocabulary test). 
Since a point scale is used the individual’s intelligence rating is obtained 
from a summation of credits awarded for passing various test items. The 
IQ can be computed from the individual’s performance in the full scale, 
verbal scale, or performance scale. Intelligence quotient tables in the 
appendix afford an easy means for converting the weighted scores into 


intelligence quotients. 


1 Bibliography for this chapter begins on page 21. 
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The method of standardization employed is worthy of special note. For 
the adult norms over 1,800 subjects, both male and female, ranging in 
age from seventeen to eighty and coming from all walks of life, were ex- 
amined and their records placed in a file. From this file, cases were taken 
to match the general occupational distributions of the census population. 
The number of adults in the standardizing population was approximately 
1,000. The children’s sample was taken so that it matched the age-grade 
distribution in the New York public schools. About 2 percent, the lower- 
grade defectives, were taken from patients up for commitment at Bellevue 
or cases already committed to some New York institutions for the feeble- 
minded. In the case of both the adults’ and children’s samples it was as- 
sumed that individuals from New York State were representative of the 
nation as a whole. 

The chief advantages claimed for the Bellevue Scales are: (a) the ma- 
terial is well suited for the testing of adults; (b) an individual’s per- 
formance is compared with the average of his own age group; (c) the 
full scale gives appropriate weight to performance and verbal tasks; (d) 
the test results agree better with clinical judgments than those obtained 
from other general intelligence tests; and (e) the test is not difficult to 
administer. 

The Tests of Mental Development, an individual examination constructed 
by Kuhlmann (40), constitutes a most important contribution to the field 
of intelligence testing. The tests making up the scale are drawn from a 
wide variety of sources including the author’s previous revisions of the 
Binet Scale and the Kuhlmann-Anderson group tests first published in 
1927. Two basic principles from the Binet Scale were retained in the new 
scale. First, mental development is measured in terms of median abilities 
of children of different ages, the results being expressed in terms of mental 
ages or mental growth units and, second, the increase in median raw score 
on a test between adjacent ages constitutes the chief criterion of its validity. 

Several distinctive features have been incorporated in the new scale. 
The use of the Heinis Mental Growth Curve in the selection and scoring 
of the tests is a unique feature. The result is that one test is administered 
or scored at every third point in this curve. A test’s position in the scale 
was determined by the requirement that 50 percent of the children had 
to pass it at that point. A total of four new scores is yielded by the scale: 
the percent of average (PA), a speed score, an accuracy score, and a 
variability score. It was suggested by the author that the latter is not fully 
satisfactory, but it does attempt the measurement of an important aspect 
of a subject’s performance. 

The standardization population consisted of white children in the public 
schools of small and medium-sized towns of Minnesota, with the exception 
of four children at each age up to five years inclusive who were from 
St. Paul and Minneapolis. Names and addresses of children in the pre- 
school classification were taken from the birth registration records. In all 
cases birth dates were very close to examination dates. The author cited 
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data to show that Minnesota is a fairly typical state. Perhaps the most 
serious limitation of the norms is the possible effect of elimination from 
school. 

The complete scale, designed for testing individuals from a chrono- 
logical age of four months to adulthood, consists of 89 tests arranged in 
order of difficulty. In administering the examination the examiner begins 
about fifty growth curve points below the estimated mental age. The tests 
are given in order from lowest to highest and are continued until it ap- 
pears reasonably certain that no more successes will be scored. Tests may 
be administered in a different sequence if such a procedure seems pref- 
erable. A convenient score card is provided for recording the individual’s 
responses in terms of rights, wrongs, and time, depending on the item. 
The score is easily translated into mental growth units which must then 
be converted to a mental age score, a percent of average score, or an IQ. 
The latter can be computed by means of tables provided in the appendix. 

This mental examination may appear formidable to a number of clinical 
psychologists. Careful study is necessary to achieve mastery of the scoring 
technics. After this has been accomplished the examination will provide 
the clinician with an array of facts which should prove helpful in the 
diagnosis of mental ability. However, further research is necessary before 
definite conclusions can be drawn regarding the clinical value of the data 
yielded by the scale. 

The Detroit Tests of Learning Aptitude (6) have recently been revised. 
There are still nineteen tests in the scale, but the name of one test has 
been changed. According to the author the chief value of the tests lies in 
their diagnostic results and the application of these to learning situations. 
The interested reader should consult the February 1938 issue of the 
Review oF EpucaTionaAL RESEARCH concerning criticisms of the scale 
in its original form. 


Evaluation of Stanford-Binet Tests 


The largest number of evaluative studies pertain to the old and new 
Stanford-Binet. Comparisons between the total scores and performance 
in the subtests of these two well-known measuring instruments have re- 
cently appeared in literature and have been concerned primarily with 
the problem of agreement between the scales and the direction of change 
in cases where changes were observed. Merrill (46), in a study of 1,517 
school children who had previously been examined with the old scale, 
reported that examination with the revised scale showed close agreement 
between the intelligence ratings. Inspection of the distribution of gains 
and losses indicated that IQ losses were found more frequently in cases 
testing below 100 while IQ gains were found more frequently in cases 
testing above 100. The larger standard deviation of the revised scale would 
lead one to expect such results. In this connection the finds of Rheingold 
and Perce (56) are of interest. These investigators compared the scores 
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on the original and the Revised Stanford-Binet Scale, Form L, at the 
high-grade mental defective levels. The scales were administered consec- 
utively with such changes that any difference in scores could be attributed 
to a difference in the content of the two scales. The results, in terms of 
IQ’s and mental ages, showed a high degree of correlation between the 
scales, but there was a tendency for the Form L intelligence quotient to 
be higher for subjects with IQ’s from 70 to 82. The latter finding is in 
contrast with that which would have been expected on the basis of Bern- 
reuter and Carr’s theoretical computations (11) and also fails to agree 
with Merrill’s results (46). 

In the Wayne County Training School, Hoakley (34) found that in 
general the Terman-Merrill IQ was higher than that obtained by the use 
of the Stanford-Binet. However, analysis of the data revealed that between 
the ages of ten and thirteen inclusive, the Terman-Merrill IQ was lower 
than the Stanford-Binet. From fourteen years up, the Terman-Merrill IQ 
was found to be higher. Using the Heinis Personal Constant in place of 
the IQ the author found the two tests to be in remarkably close agree- 
ment. Munson and Saffir (51) retested one group of 1,000 children with 
the old Stanford-Binet and a second group with the revised scale, the old 
Stanford-Binet having been used in the original testing. They found a 
drop in IQ with either scale but the drop was smaller when the revised 
scale was used. The investigators concluded that the change in IQ is no 
greater with the revised scale than with the original one. Atwell (5) com- 
pared the ratings of 100 unselected subjects in the vocabulary test of the 
Stanford-Binet and the Terman-Merrill Revision. The average mental age 
in the Stanford-Binet was fourteen years, whereas in the Terman-Merrill 
Revision it was seventeen years and four months. However, the percent of 
words known on the two tests was the same and the correlation between 
the two tests was .86. Elwood (25) also found the vocabulary test in the 
Revised Stanford-Binet Scale, Form L, to be easier than the vocabulary 
test of the old Binet as indicated by the larger percents of success at the 
lower mental age levels. 


Interpreting Patterns and Variability in Scores 


Irregularities in patterns of successes on the 1916 Stanford-Binet and 
the 1937 revision constitute an interesting and important problem for 
clinical psychologists. Some have interpreted scaticr as an indication of 
psychotic trends while others have regarded it as a symptom of uneven- 
ness in test difficulty. Harriman (31) found in a study of 200 sixth-grade 
children, 175 of whom had a basal mental age of ten and 25 a basal age 
of eleven, that 52 percent achieved successes in one or more items at the 
fourteen-year level and 3 percent scored successes at the Superior Adult 
III level on the 1937 revision. With respect to specific items the results 
are even more striking. For example, at the average adult level, 40 percent 
of Harriman’s subjects succeeded in the codes, but only 10 percent ex- 
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plained the proverbs. The average IQ of these subjects was 112. The vo- 
cabulary test furnished the best single criterion for estimating how far 
a pupil might achieve on the scale. It was suggested by Harriman that 
the clinical experience gathered with the use of the 1916 revision is not 
directly applicable to the 1937 revision. 

Using four different measures of scatter, Harris and Shakow (32) in- 
vestigated the variability of performance in the Stanford-Binet of 154 
schizophrenic, 133 normal, and 138 delinquent adults. Of the various fac- 
tors studied—psychotic condition, delinquency, chronological age, edu- 
cation, length of hospitalization, attitude, and mental age—only mental 
age proved to be related in any considerable degree to amount of scatter. 
When the groups were matched with respect to mental age the differences 
which originally existed tended to disappear. The Pressey measure of 
scatter was found to be superior to the other measures used in the study. 

Berger and Speevack (10) studied the range of scatter among 196 re- 
tarded children on Form L of the Revised Stanford-Binet. They found that 
when the test was extended beyond the first-year level in which all items 
were failed, the MA was increased on the average by two months, the 
range of increase extending from zero to fourteen months. Increases oc- 
curred in 42 percent of the cases, resulting in an average increase of 3.2 
months in MA for this group. The items frequently passed beyond the 
first zero point were drawing of designs from memory, making of change, 
picture absurdity, and the word memory and problems of fact at the 
thirteen-year level. 

The findings of Hildreth (33) confirm those of other investigators with 
respect to size of IQ and range of successes in the 1937 revision of the 
Stanford-Binet when compared with the 1916 revision. In this study it 
was found that for subjects ranging in age from six to seventeen there 
was a substantially larger increase in IQ when children originally given 
the 1916 edition were retested with the 1937 edition than when they were 
retested with the 1916 edition. There was an increase in IQ approaching 
statistical reliability for those taking the new and old forms of the test, 
and the range of success on the revised tests for subjects basing above 
the ten-year level was found to be substantially greater than on the 1916 
revision. The greatest increment in score occurred at the eleven-, twelve-, 
thirteen-, and fifteen-year age levels. With respect to this point, Hildreth’s 
findings are in disagreement with those of Hoakley. 


Clinical Evaluation 


The most lengthy criticism of the 1937 revision of the Stanford-Binet 
has been written by Kent (37). Her criticisms were written from the point 
of view of a practical clinician and center around the complexity and 
rigidity of the new scale. It is the judgment of Kent that language test 
material should be developed according to the method used by Pintner 
and Paterson for a group of performance tests. An item which can be 
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graded in difficulty should be developed into a graded series and stand- 
ardized as an independent unit. With a sufficient number of test units avail- 
able, the examiner could make up a battery custom-made to fit the indi- 
vidual subject. The median performance in such a series of independent 
ratings could then be taken as the most typical of the individual. Since 
items cannot readily be lifted out of the revised Stanford-Binet and be 
replaced with more suitable ones, Kent expressed herself as believing that 
the scale lacks the flexibility which should characterize a mental test suit- 
able for clinical practice. The latter criticism has also been made by 
Krugman (39). 

In a somewhat dissimilar vein Vernon (73) argued that the clinical 
psychologist has a tendency to rely too heavily upon his personal ex- 
perience and is, therefore, in need of the highly objective devices provided 
by the psychometrists. According to Vernon, the revised Stanford-Binet 
gives the clinical psychologist an excellent psychometric device which at 
the same time permits the necessary flexibility to control the subject sit- 
uation as it obtains in individual cases. That many clinical psychologists 


share this point of view is attested to by the wide use of the new Binet 
scale. 


Other Studies and Evaluations 


The question of correlation between various mental tests is one of ex- 
treme practical importance as well as of theoretical interest. Arthur (3) 
studied the results obtained by administering the Kuhlmann-Binet and 
Stanford-Binet to two hundred subjects and found the median difference 
in 1Q’s to be three points. Using the Heinis Personal Constant she found 
a median difference of two points. She concluded that the agreement is 
as great or greater between the two scales as it is between test and retest 
scores using the same scale. Naturally there were some cases showing a 
large discrepancy between results in the two tests. 

The present wave of interest in factor analysis has also invaded the 
field of individual tests. Wright (82) recently completed a factor analysis 
of the items in the old Stanford-Binet passed by between 10 and 90 per- 
cent of a group of 456 ten-year-old children. Upon rotation, there appeared 
a common factor for which two explanatory hypotheses were offered: first, 
that it represented Spearman’s “G” and, second, the one preferred by 
Wright, that it was an effect of maturation. The primary factors tentatively 
identified were Number, Space, Imagery, Verbal Relations, and Induction. 
A sixth factor apparently involved reasoning ability and a seventh could 
not be interpreted. 

To date, but one evaluative study of the Wechsler Scale has appeared. 
Balinsky, Israel, and Wechsler (7) compared the relative effectiveness 
of the Stanford-Binet and the Bellevue Intelligence Scale in diagnosing 
mental deficiency. They found the Bellevue full scale to be distinctly su- 
perior to the Stanford-Binet, assuming that psychiatrists’ judgments con- 
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stitute an adequate criterion for judging mental deficiency. The full scale 
was found to give better results than either the verbal or performance 
scales alone. These investigators reached the judgment that performance 
tests should be included in attempts to differentiate between borderline 
intelligence and mental deficiency. 

The limitations of infant and preschool tests in the measurement of 
intelligence have been discussed in some detail by J. Anderson (1). The 
author cited evidence to show that the type of intelligence measured early 
in life is not the same thing that is measured by intelligence tests at later 
stages of maturity. Since development is a timed series of relations or se- 
quences, there are functions which are measured only in part at one stage 
but which can be measured more completely at later stages. The result is 
that the correlation between early and late scores in intelligence tests is 
likely to be low. In addition, if tests are administered in childhood, the 
errors of measurement are likely to be greater and hence less reliance 
should be placed on the results obtained by means of one test. These 
arguments of Anderson would seem to imply that intelligence test scores 
obtained early in life are not reliable indexes of later intellectual status. 
Such a conclusion is indeed far-reaching in its practical and theoretical 
implications. 


Group Intelligence Tests (Verbal) 
New and Revised Tests 


Among the recent new group tests is the Fifth Revision of the Kuhlmann- 
Anderson Tests. Several departures were made from the old arrangement 
of tests to eliminate from the lower ages or to move to higher ages tests 
dependent on school training. The range of mental ages possible was in- 
creased for the Grade II booklet and in the Grades VII-VIII booklet. Im- 
provements were also made in the instructions for administering the tests 
and in the method for securing median mental age or mental growth units. 
R. Anderson (2) cited evidence to show that the tests are not mere meas- 
ures of school achievement. 

The new Pintner General Ability Tests, Verbal Series (52) consist of 
four batteries of mental tests, narzely, the Pintner-Cunningham Primary 
Test, Pintner-Durost Elementary Test, Pintner Intermediate Test, and Pint- 
ner Advanced Test. The elementary test is still in prey aration. The primary 
test consists of seven subtests, and the intermediate and advaneed tests each 
consist of eight subtests. There are two forms of each battery available. 
The individual’s score is computed in terms of median mental age in the 
several subtests. A nomograph is provided by means of which IQ’s can 
be computed. The nature of the test and its method of standardization 
make it appear to be one of our better group tests. 

One of the most interesting revisions of the Army Alpha Examination 
has been prepared by Guilford (29). This revision is based upon an 
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analysis of scores made by a sampling of University of Nebraska students 
in forms 5, 7, and 9 of the original Alpha and the Bregman revision. 
Comparison of the performance of high and low groups revealed that 
some items showed no difference between the groups and others showed 
negative differences. In the new revision the poor items were deleted and 
a better arrangement of items in order of difficulty was achieved. A factor 
analysis of the scores revealed the presence of three factors: V, the verbal 
factor; N, the number factor; R, a “relations” factor. The revision can 
be scored, for these factors and norms based upon the performance of 
University of Nebraska students have been provided. The difficulties en- 
countered in scoring a test for primary mental abilities were fully dis- 
cussed. Individuals who are interested in measuring these aspects of mental 
ability should find this revision of the Army Alpha useful in their work. 

The Personnel Test developed by Hovland and Wonderlic (35) is a 
revision of the Otis Self-Administering Test of Mental Ability, Higher 
Form. By analysis of the items in the old test the authors were able to 
reduce the length of the test by one-third, making it possible to administer 
it in twelve minutes. Correlations between the Personnel Test and the 
Otis ranged from .81 to .87. Since the test was designed primarily for use 
in business personnel work, the authors did not provide tables for con- 
verting raw scores to mental ages. It is possible, however, to translate 
scores from the Personnel Test into comparable scores on the Otis. Norms 
for representative industrial and business groups are provided. Three 
forms are available. 

Other new tests or revisions which have recently appeared are the Cali- 
fornia Short-Form Test of Mental Maturity (66) and the Detroit Begin- 
ning First-Grade Intelligence Test, Revised (26). Annual forms of the 
American Council Psychological Examinations for High-School and for 
College Students-have continued to appear. 


Evaluations of Group Tests 


Careful studies of validity and reliability coefficients and norms pre- 
sented by test authors are all too rare. An example of the type of study 
which should be made of every mental test is Traxler’s investigation (71) 
of the reliability, validity, and practical utility of the California Tests of 
Mental Maturity. He found the religbility coefficients for his group of 74 
ninth-grade pupils to be slightly lower than those reported by the pub- 
lishers. The language and nonlanguage factors were found not to be highly 
correlated in a group of 2. eighth-grade and 73 ninth-grade pupils. How- 
ever, the total score in the test did correlate highly with the Kuhlmann- 
Anderson tests of mental ability and with the American Council Psycho- 
logical Examination. The differences between language and nonlanguage 
IQ were found to be much greater for superior than for inferior readers. 
Possible use of the tests for measuring the intellectual capacity of poor 
readers is suggested by the findings. 
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The results obtained when tests are readministered after long intervals 
of time are frequently discouraging. However, Davidson (23) reported a 
high correlation between the scores made by subjects in the old form of 
Bureau Test VI and the most recent revision of the test, the latter having 
been given approximately ten years later. Despite the fact that the material 
in the two tests is not closely comparable, the correlation between the 
test scores for fifty subjects was found to be .89. The test results were 
also found to correlate highly with level of clerical work achieved by the 
subjects included in the study. Results such as these are favorable to the 
use of intelligence tests in industry. 

Those who make use of test results often feel the urge to break down 
a total score into its several parts. Because the necessary research data 
frequently are lacking, meaning cannot be attached to many subtest scores. 
In a study of the American Council Psychological Examination, Super 
| (67) investigated the differential prediction value of the Q and L scores 
by correlating them with performance in the Nelson-Denny Reading Test, 

Cooperative Survey Test in Mathematics, and the two parts of the Minne- 
sota Vocational Test for Clerical Workers. The correlation between L 

. scores and reading ability was .80 while a coefficient of .37 was found 
; between Q and reading scores. All other differences in correlation for the 
- Q and L scores were small. The Q and L scores predicted success in mathe- 
matics with equal efficiency. From a guidance standpoint, it would seem 
that one would not be justified in using the quantitative score as a differen- 
tial predictive index of achievement in mathematical subjects. 

The question frequently arises whether tests administered in high school 
give the same results as those administered in the fall to entering college 

freshmen. Thomson’s study (68) throws some light on this question. A 
group of 106 college freshmen took the 1935 form of the American Council 
on Education Psychological Examination as high-school seniors in Jan- 
uary 1937 and in September as entering college freshmen took the 1937 
} form of the test. The correlation between the gross scores was found to be 
87. When the scores on the 1937 form were converted into equivalent 
scores on the 1935 form it was found that 49 percent of the scores changed 
: twenty or more points while approximately 5 percent changed forty or 
more points. Thomson concluded that when dealing with large groups for 
the purpose of predicting college success it makes little difference whether 
the ACE test is administered during the last year of high school or upon 
entrance to college. 
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New and Revised Tests 


One of the most recent performance tests to appear is the Carl Hollow 
; P PI 

Square Scale (18). This test was designed for use with adults and older 
children. The materials consist of a wooden panel in which is cut a 414 
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inch square hole, and 29 blocks of varying straight line geometric forms, 
each having both straight and bevelled edges. The problem is to fill the 
hole with blocks which are presented to the subject in groups of three 
or four. A total of twenty exercises are used in the test. The observant 
subject will note that certain combinations are repeated and consequently 
will be able to do some of the more difficult exercises with comparative 
ease. Each of the exercises is timed and the number of moves is counted. 
The score, based on the subject’s speed and accuracy, is translated di- 
rectly to an IQ which can then be converted into a mental age score. The 
test correlates well with other measures of mental ability, both of the 
verbal and nonverbal type. While the author admitted that the test meas- 
ures mental abilities which lean to the practical or concrete, he insisted 
that it is not a test of highly specialized abilities. It would appear, how- 
ever, that the test possesses possibilities as a mechanical aptitude test. The 
administration and scoring procedures appear a bit complicated and 
should be mastered thoroughly before the test is used seriously. 

Grove (28) introduced certain structural modifications in the Industrial 
Model of the Kent-Shakow Formboard Series, thereby simplifying the 
tests and facilitating the administrative procedures. The number of sub- 
tests was reduced from five to four and the scoring system modified to give 
proper weight to time and error factors. Data were presented describing 
the correlation of the series with other performance tests. The author be- 
lieves that the revised series may be said to measure ability to solve prob- 
lems presented in the form of concrete spatial relations. That the test 
does not measure the same type of mental functions as those measured by 
the Stanford-Binet is shown by the correlation of .43 between the two tests. 

Shakow and Pazeian (58) presented adult norms for the Clinical Model 
of the Kent-Shakow Formboard Series, thus increasing the usefulness of 
this mental examination. Scoring is in terms of decile rank derived from 
time scores. 

The most recent revision of the Ferguson Formboards was prepared by 
Wood and Kumin (80). Instead of scoring the tests exclusively on the 
basis of time required to complete a board, these investigators took into 
consideration the blocks correctly placed at the end of the allotted time, 
which is five minutes for each of the six boards. In addition, they divided 
the five-minute period into fifteen 20-second intervals and worked out a 
score value for each. Results obtained by the authors indicated that the 
correlation between Stanford-Binet mental age and Ferguson score is dis- 
tinctly lower for retarded children than for those in the middle and upper 
ranges of mental ability. They concluded that in the lower group, intelli- 
gence is less of a factor in performance than a special aptitude. 


Studies and Discussion of Performance Tests 


The controversy over the relation between “verbal intelligence” and 
“practical intelligence” is still not settled. Slater (60) made a factor 
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analysis study of several tests purporting to measure spatial judgment 
and verbal ability. The existence of a mechanical ability distinguishable 
from both general intelligence and spatial judgment was not confirmed. It 
was found, however, that spatial judgment or “practical ability” is an in- 
dependent psychological function. A new Form Relations Test, differen- 
tiating between recognition and imaginative manipulation, was found to 
be a measure of the spatial relations factor. Some evidence was presented 
to show the value of the tests in selecting trade and engineering appren- 
tices. 

The use of the Porteus Maze Tests in clinical practice was discussed in 
two recent studies. Brill (15), in a study of fifty socially well-adjusted 
and fifty seriously maladjusted mentally deficient boys, found that the 
maladjusted boys scored higher on the average than the well adjusted 
and concluded that earlier statements as to the validity of the test in 
measuring social adaptation were not fully justified. In reply, Porteus (55) 
pointed out that Brill’s interpretation of feeble-mindedness was not in 
accordance with accepted standards and cited further evidence to indicate 
the validity of the test as a measure of the subject’s prudence and plan- 
ning capacity. In addition, Porteus emphasized that the maze test is a 
supplement to and not a substitute for the Binet. 

The effect of practice upon achievement in performance tests is of con- 
siderable interest. Mitrano (49) made a study of the readministration of 
the Witmer formboard by investigating the performance of fifty-seven 
feeble-minded subjects who had taken the Witmer and Stanford-Binet tests 
on four different occasions. The successive scores in the Witmer test were 
progressively higher and increased in magnitude twice as rapidly as did 
the corresponding Stanford-Binet scores. Subjects scoring highest in the 
Stanford-Binet showed a slight tendency to profit most from their pre- 
vious experience in taking the Witmer test. 

The relationship between performance in verbal and nonverbal tests of 
intelligence for bright and dull children was studied by MacMurray (42). 
He compared the intelligence of gifted children and of dull-normal chil- 
dren as measured by the Pintner-Paterson and Stanford-Binet scales. The 
dull group showed a mean increment of 9.4 points of IQ on the Pintner- 
Paterson as against the Stanford-Binet, but there was considerable over- 
lapping in IQ’s obtained from the Pintner-Paterson. For the dull group 
the correlation between IQ’s obtained in the two tests was .43 while for 
the superior group the correlation was .23. As pointed out by MacMurray, 
results such as these should give us pause when the tests are used inter- 
changeably. 

The findings in the factor analysis study by Morris (50) throw con- 
siderable light on the low correlations frequently found when the scores 
on verbal and performance tests are correlated. The analysis was made 
of thirty-four well-known performance tests which were administered to 
fifty-six nine-year-old boys in a New York City grade school. The inter- 
relationship among the tests ranged from moderately high positive to a 
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moderately high negative relationship. The presence of one general factor 
was not disclosed but three group factors or abilities were rather clearly 
indicated. These were identified by Morris as being similar to the Spatial, 
Perceptual, Speed, and Induction Factors described by Thurstone. Should 
these findings be confirmed by other investigators it could then be demon- 
strated that verbal and nonverbal tests do not measure the same compo- 
nents of what is frequently called general intelligence. For clinical prac- 
tice the implication would be that the tests should be used to supplement 
one another rather than interchangeably. 


Measures of Particular Mental Abilities 


The nature of mental abilities was brought more sharply to the attention 
of the practical worker by the publication of Primary Mental Abilities by 
Thurstone (70). The battery consists of sixteen tests arranged to be ad- 
ministered in three sessions requiring two hours and thirty-three minutes 
of testing time. In addition, a total of one hour and eight minutes is re- 
quired for the practice exercises which precede each test. The battery 
yields seven scores which represent relatively independent or uncorre- 
lated mental abilities or factors: Perception, Number, Verbal, Space, Mem- 
ory, Induction, and Reasoning. Each of the factor scores is obtained by 
the simple addition of the raw scores of two or three tests, the score being 
the number right—except that in four instances it is the number right 
multiplied by two. A profile of the individual’s standard score in each of 
the seven primary abilities can readily be prepared. 

The number of research studies reporting the use of the new battery 
is as yet small. Stalnaker (62) made an extensive analysis of tests taken 
by the freshmen entering the University of Chicago in 1938. Most inter- 
esting is his table of intercorrelations between the scores in the sixteen 
tests. Where an individual’s factor score was computed by adding the 
scores in two tests he sometimes found a very low correlation between 
the tests. In addition he found fairly high correlations between several of 
the factor scores indicating that for his population the factors were not 
strictly independent or uncorrelated. This investigator also expressed the 
belief that speed figures prominently in the scores though it is not listed 
as a primary factor. Some of the items were found to correlate poorly 
with the factor they were supposed to measure and for the most part the 
items were found not to be arranged in order of difficulty. In general, 
these criticisms are in agreement with those of Crawford (20). 

Thurstone (69) pointed out that when scores in several tests are com- 
bined to give a composite measure of a primary factor, the separate tests 
should have low correlations. If the several tests within a composite had 
high intercorrelations, they might have in common not only the primary 
factor but also other factors not intended to be measured by the com- 
posite. Thurstone also presented facts to show that the scores as now 
obtained for various primary factors must of necessity be correlated. Since 
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each primary factor score is determined as a linear combination of several 
test scores and this composite score is only an estimate of the primary 
factor which has a sizeable saturation in each test of the composite, it 
follows quite naturally that such raw scores will be found to correlate 
positively. 

Despite the fact that there are defects in the present battery of Primary 
Mental Abilities Tests, the present reviewer is of the opinion that they 
represent a milestone in the measurement of human abilities. One of the 
foundation stones of guidance and personnel work is that individuals 
differ in their capacity to perform various types of tasks. Efforts to measure 
these abilities have lacked the precision required for practical utility. The 
theoretical foundations and the research underlying the construction of 
the Tests of Primary Mental Abilities open up new possibilities for indi- 
vidual diagnosis and guidance. The findings to date are not final but are 
promising and challenging. 

Tests designed to fill rather specific clinical needs are gradually be- 
coming more numerous. Shipley (59) developed a scale for the measure- 
ment of mental deterioration based upon the clinico-experimental ob- 
servation that in such detericration vocabulary level tends to be affected 
but slightly while ability to see abstract relations declines rapidly. The 
impairment index employed in the scale is a vocabulary-abstract thinking 
test of twenty completion items. Each test has a ten-minute time limit. 
Standardization for mental age was based on a group of 1,046 normals. 
Extensive evaluative data are not yet available. 

Cattell (19) recently described a culture-free intelligence test which he 
believed suitable for measuring the intelligence of widely different racial 
and cultural groups. The seven subtests are all perceptual in character, 
but Cattell pointed out that they have consistently manifested high “G” 
saturations in the work of earlier investigators. The directions can be given 
in pantomime if necessary, thus eliminating or reducing the effect of a 
subject’s knowledge of language upon his score in the test. Many persons 
upon examination of the tests will argue that all aspects of an individual’s 
intelligence will not be sampled by the test exercises. Cattell believes that 
most of the higher-order mental processes can be measured by means of 
perceptual tests. Further research should decide the relative merits of 
these two points of view. 
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CHAPTER III 


Applications of Intelligence Tests’ 
J. B. STROUD 


Two KINDS OF applications of intelligence tests are discussed in this 
paper: (a) those of a research character in which tests are used in studying 
the various conditions and correlates of intelligence, and (b) those of a 
more practical nature in which tests are used in the interest of selection 
and prediction. The more detailed studies will be presented last. 


Racial Factors in Intelligence 


Much of the confusion of tongues in respect to racial differences in 
intelligence undoubtedly stems from differences in meaning of the terms 
intelligence and race. In an ethnological sense a racial characteristic is 
one that is transmissible by descent and is thus independent of the cultural 
environment in which individuals are reared. It may very well happen, and 
does, that some characteristics of races are not racial, at least when racial 
is used in an ethnological sense. Thus the assertion that one characteristic 
of the American Negro is low test intelligence need not imply that this 
characteristic of a race is truly a racial characteristic. It may be a cultural 
one; which would mean, if true, that when put on a par in opportunity with 
white Americans the Negroes would also be on a par with them in test in- 
telligence. Investigations show favorable comparisons between white and 
Negro children when they are thus matched economically and educa- 
tionally—which may be due to selective factors. The fact remains that 
Negroes as a whole are by no means on a par with white children in these 
respects and that as a whole they are not on a par with them in test 
intelligence. 

Performance on intelligence tests is not the only respect in which Negro 
children display less intelligence than white children. For example, they 
learn less readily, a fact that is itself predictable from their inferior test 
intelligence. These are simple statements of fact. It would be a mistake, 
however, to conclude that these differences are racial simply because they 
are characteristics of a race. 

Jenkins (43) maintained that inasmuch as white and Negro groups under 
comparison have not generally been equated as to environmental and cul- 
tural background, the comparisons are invalid. Invalid, with respect to 
what? Possibly, though not certainly, invalid to show the presence of racial 
differences, but not invalid to show differences between races. Attention is 
called also to a recent article by Hollingworth and Witty (37). This article 
and the one by Jenkins (44) reported the results of a rather extensive 


‘Bibliography for this chapter begins on page 36. 
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search for superior intelligence among Negro pupils. Evidence accumulated 
for a score of years tended to show that from 20 to 25 percent of the Negro 
population equal or exceed the median test intelligence of the white popu- 
lation (24). In a sampling of 8,000 Negro children they found 1.23 percent 
to earn Stanford-Binet IQ’s of 120 or above. According to the best available 
data, approximately 8 percent of the white children equal or exceed this 
figure. By finding the best value of the mean and SD of the IQ of Negro 
children from the published literature, one could predict quite reliably the 
percent that would have IQ’s at or above 120 or any other IQ value. 

In a survey of the DuSable High School (Chicago) Beckham (4) found 
by the Henmon-Nelson tests that according to Terman’s classification 1.5 
percent were mentally defective; 10.2 percent, borderline; 37.0 percent, 
dull; 44.2 percent, average; 5.4 percent, above average. Two percent 
earned IQ’s above 120. Hu (40) found a slight superiority of 116 Anglo- 
Chinese children of Liverpool and London over English children of com- 
parable schooling and socio-economic status. 


Family Factors 


Size of family—Inasmuch as a positive correlation obtains between 
occupational rating and test intelligence, and a negative correlation between 
occupational rating and birth-rate, we may infer that test intelligence is 
negatively correlated with birth-rate. According to Penrose’s data (77), 
there are 4 children, on the average, born to families of unskilled occupa- 
tional classification and 2 to those of professional classification. Roberts 
(89) tested all of the children between nine and thirteen years of age in 
the city of Bath. A correlation of —.23 was obtained between IQ and the 
number of children in the family. He reported that the infertility of the 
gifted poor equalled that of the gifted rich. Cattell (15) estimated, from 
sociological data, that there has existed an inverse relationship between 
social status and fertility since at least 1870 in Britain and 1890 in America. 
Moreover, “When an allowance for environmental handicap in test per- 
formance has been made, rural groups, both in America and Britain, aver- 
age about 5 points of [Q lower than city groups. Their fertility is greater.” 
Various investigations in Britain and America have yielded coefficients 
of the following magnitude between test intelligence and fertility: —.33, 
—.27, —.30, —.25, —.19, and —.34. 

Moshinsky (71) studied the relationship between fertility and test in- 
telligence in offspring in the case of about 10,000 children with fairly 
homogeneous socio-economic groupings. While he found a low negative 
correlation between fertility and test performance, the magnitude of the 
coefficients is greatest for the middle class. In fact they lack statistical sig- 
nificance for the samplings drawn from the poorest and most prosperous 
sections. Willoughby (113) found no relationship between the test intelli- 
gence of college women and the number of living children, for a class 
having graduated in 1927. 








ee ee 








Ss 5s ce oe SB 


i Bie Be 
‘ ‘ . 


~ 








oh en 

















February 1941 APPLICATIONS OF INTELLIGENCE TEsTS 





Parent-child relationships—With tests which, in comparison with the 
prevailing ones, are relatively culture-free, Cattell (14) found a correlation 
of .91 between test intelligence of mid-parent and child. The reader will 
recall that r = .50 is customarily taken as the amount of relationship to 
be expected of parent-child performance. Willoughby (112) obtained a co- 
efficient of .51 between father and son-daughter scores and of .55 between 
mother and son-daughter scores. In a recent article Conrad and Jones (20) 
reported a correlation of .49 between parent and offspring Stanford-Binet 
scores, and of .49 between parent and offspring Army Alpha scores, 997 
individuals, 269 families being represented. An average correlation of .49 
is also reported between sibling pairs. 

Age of parent—The relation between parental age at time of birth and 
the test intelligence of the offspring was treated in a recent article by Punke 
(83) with respect to college women. It is found, for example, that of the 
entering freshman who earned less than an average rating on the psycho- 
logical examination, 58 percent had fathers no more than 30 years older 
than themselves, 67 percent had mothers no more than 25 years older than 
themselves. Of those in the upper half on the psychological examination, 
46 percent had fathers no more than 30 years older than themselves; and 
50 percent, mothers no more than 25 years older. In performance on 
English, mathematics, social studies, and science, the differences are gen- 
erally greater than they are on the psychological examination. It should be 
said, however, that data such as these do not prove that it is more favorable 
to be born of fathers over thirty and mothers over twenty-five than of 
younger parents if reference is made to the same parents. Certainly one 
could not justify from these data the recommendation that parenthood 
should be delayed until these ages have been passed. The number of un- 
controlled factors could be reduced by comparing children within the same 
family, as in the studies of order of birth. 


Birth Factors 


Prematurity of birth—For seventy-eight children who were born pre- 
maturely, Benton (5) found no evidence of adverse effects on test intelli- 
gence at preschool ages. No relationship was indicated between weight at 
birth and test performance at preschool age. 

Seasonal variations—The reader will recall that certain investigators, 
MacMeeken (64) and Pintner and Forlano (78), have obtained evidence 
of a very slight association between IQ and the month of birth. Children 
born in a warm month have | or 2 IQ points’ advantage over those born in 
a cold month, February being the least favorable and September the most 
favorable. Fialkin and Beckman (25) observed the same phenomenon in 
a sampling of about 3,000 adults. Pintner and Forlano (79) have com- 
piled data gathered from various sources in the southern hemisphere. The 
results tend to show a very slight advantage, about 1 IQ point in favor of 
those born in warm months. 
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Socio-Economic Status 


Upon the basis of an extensive survey of the literature: pertaining to 
socio-economic status, Loevinger (59) concluded that the correlation be- 
tween the test intelligence of children (ages three to eighteen years) and 
the occupational ranking of their fathers may be represented by the value 
r =.4. She found no evidence of a positive correlation between these two 
variables in the case of children under eighteen months of age. Honzik (38) 
obtained negligible coefficients between test scores of children and socio- 
economic index and education of parents in the case of children between 2] 
and 36 months of age. At three and one-half years of age a mean coefficient 
of about .25 was obtained. The magnitude increased somewhat for suc- 
cessive yearly examinations up to age seven, at which age the mean co- 
efficient stood at about .40. This phenomenon may, of course, be due in part 
to the unsatisfactoriness of the tests of intelligence at the lower age levels. 
Loevinger (59) estimated from her review of the literature that the average 
intelligence test achievement of the children who came from the highest 
occupational group was one standard deviation above the mean, and that of 
those from the lowest occupational group was one-half standard deviation 
below the mean. These correspond to IQ values of 116 and 92, respectively, 
taking 16 as the SD of the IQ. In an analysis of the intelligence test scores 
of University of New Mexico from 1921 to 1936, Haught (31) found the 
mean scores to rank according to fathers’ occupations, in descending order, 
as follows: professional, business and clerical, skilled, semi-skilled, and 
unskilled. 

A comprehensive summary of the investigations on the test intelligence 
in backward communities has been published recently by Mann (68). 
Similarly, Neff (72) supplied a summary of the investigations of the rela- 
tionship between socio-economic status and test intelligence. A recent 
article by Bayley and Jones (3) also furnished pertinent data. 

Ball (2) correlated Pressey Mental Survey IQ’s derived in 1918 and 
again in 1923, when the subjects were distributed from Grades II to X, 
with occupational status attained in 1937. A correlation of .71 was found 
between the 1918 Pressey scores and 1937 ratings of occupational status 
by the Barr scale. Between the 1923 Pressey scores and the 1937 ratings 
the correlation was .57. Clark and Gist (19) obtained about the same rela- 
tionship between test intelligence and occupational choice of high-school 
pupils as that which obtains between the test intelligence of pupils and 
occupations of their fathers. 


Personality, Behavior, and Intelligence 


In a review of the literature Lorge (60) compiled a list of over two 
hundred correlation coefficients. One-half of these range in magnitude from 
.00 and .15; and one-fourth are above .30. Wile and Davis (110) found 
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that competency of social adjustment is inversely related to the disparity 
between chronological age and basal age. They also compared the behavior 
of one hundred children having IQ’s from 120 to 148 with the behavior of 
an equal number of children having IQ’s from 50 to 79. Infantilism and 
regressive emotional and social behavior were more frequent among the 
superior; disturbing behavior at home, maladjustment at school, and inter- 
sibling conflicts, among the inferior (109). 

Delinquency—A. W. Brown and Hartman (10) reported the results of 
routine testing of adult male prisoners in Illinois for the years 1930-1936 
(N = 13,454). Bregman Alpha and some Stanford-Binet and Arthur Per- 
formance scores were obtained. As in previous surveys, the test intelligence 
of the prisoners distributed about as that of the army draft, with a dispro- 
portionate number of low scores. Williams (111) surveyed the literature on 
juvenile delinquency and found average IQ’s ranging from 79 to 90 re- 
ported for different populations. Since juvenile delinquency is negatively 
correlated with socio-economic status, something of this general tendency 
should be anticipated. The test performance of delinquents is roughly of 
the same order as that of nondelinquents of the same socio-economic 
status. 


Constancy of Mental Test Performance 


Recent investigations have shown that the constancy of mental test per- 
formance increases with age during the preschool period. This work was 
summarized in a recent article by Honzik (39), whose own investiga- 
tions are confirmatory. Predictions made from test performance upon 
single tests are found to be especially unreliable when the tests are ad- 
ministered before the age of two years. The data of Nelson and Richards 
(73,74) are capable of the same interpretation. In the case of 72 children 
they obtained correlations of .37 between performance on the Gesell items 
administered at age six months and the Merrill-Palmer Test administered 
at age twenty-four months, and of .46 between the Gesell items (age six 
months) and the Stanford-Binet Test administered at age thirty-six months. 
In a later article (80 cases) a coefficient of .56 was reported between the 
Gesell test administered at age twelve months and the same test read- 
ministered at age eighteen months. The Gesell test administered at age 
twelve months correlated with the Merrill-Palmer at twenty-four months to 
the extent of .32 and at thirty months, .35; and with the Stanford-Binet at 
36 months, .33. 

The reader will recall that coefficients in the neighborhood of .90 are 
readily obtainable between IQ’s derived several years apart in the case 
of children of school age. It is not known, at least to the writer, whether 
the relative inconstancy of test performance at the younger age levels is 
due to the inconstancy of mental development at these levels or to the 
inadequacies of the mental tests. 
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Effect of Environmental Factors 


The proponents of nature and nurture have squared off anew under the 
impetus of the Thirty-Ninth Yearbook of the National Society for the 
Study of Education which, aside from the nature-nurture issue, is a com- 
petent work on test intelligence and its correlates. No radically new 
weapons of warfare are introduced. We read about twins, nursery school 
influence, and foster-home placement, as during the last ten or fifteen years. 

In a few scattered investigations increments in IQ have been associated 
with nursery school attendance. Others have denied that any such thing 
happened in their nursery schools. It is not clear why anyone should believe 
that the nursery school should per se have any special IQ-raising properties. 
It is at least understandable how such an institution might produce this 
effect under certain conditions, the most important probably being whether 
the nursery school represents a marked improvement over the kind of en- 
vironment the child has lived in up to the time of matriculation. The writer 
is not maintaining that such a contrast has or will enlarge the IQ; in fact, 
examination of the published research fails, on the whole, to give any great 
promise. The paper of Reymert and Hinton (86) represents a step in the 
right direction. The background of each of their cases (for the most part 
children of school age) prior to their entrance into “a relatively superior 
environment” was fairly well known. 

Of the eleven articles in the Thirty-Ninth Yearbook of the National 
Society for the Study of Education that present research bearing directly 
upon the effect of nursery school attendance on the IQ, seven—Anderson 
(1), Bird (6), Goodenough and Maurer (29), Jones and Jorgensen (45), 
Lamson (52), Olson and Hughes (76), and Voas (105)—gave negative 
results; four—those of Frandsen and Barlow (possibly) (26), Kephart 
(46), Starkweather and Roberts (99), and Wellman (107) gave positive 
results. (See also an article by Skeels and others (95).) This gave the 
status quo. It is not to be supposed that important psychological problems 
can be settled by simple addition. One experiment of the right kind may 
be more definitive than a dozen poor ones. Research on the effect of 
nursery school education upon the IQ cannot be expected to contribute 
much to the nature-nurture issue except in instances in which tenure 
in such a school represents a wide, definable, and clearly known discrep- 
ancy in educational opportunity as between the home and the school. 

The papers of Skeels (93, 94) and Skodak (96, 97) presented a promis- 
ing procedure in studying the effects of foster-home placement. About 150 
infants of illegitimate parentage and under six months of age were placed 
in superior adoptive homes. Periodic mental examinations, the last one 
made when they had attained the age of approximately four years, indi- 
cated a level of performance such as would be expected of children born 
in homes of similar status. About half the true mothers were examined 
and found to test low, a condition normally to be expected. Hasty gen- 
eralization is ill advised. However, these are the most favorable investi- 
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gable conditions that have as yet been found; here, if anywhere, we should 
expect a good environment to make itself felt. Experience may be expected 
to refine the details of the procedure. As future experimentation goes on, 
this is a kind of approach that should be watched. The reader’s attention 
is called also to an article by Wells and Arthur (108) on the effect of 
foster-home placement. 

This discussion closes with a quotation from Carter (13) as he sum- 
marized his article: “The whole array of twin-studies seems to suggest, to 
the writer at least, the futility and artificiality of the idea of untangling 
nature and nurture influences in the sense of ascertaining the percentage 
contributions of each in any general sense.” 


Intelligence Test Performance and Academic Achievement 


Bruce (12) obtained a correlation of .31 between the IER Intelligence 
Scale CAVD, Levels M, N, O, P, and Q, and point-hour ratios earned by 
440 master’s degree students at Colorado State College of Education. The 
mean point-hour ratio of the lowest 25 percent on the CAVD was 3.72; 
that of the highest 25 percent, 4.25. This was found to be a statistically 
reliable difference, although there may be some question about the legiti- 
macy of using PE and SE formulae in tests of significance of differences 
between truncated distributions. Of 589 freshmen engineering students, 63 
percent of those placed in deciles 8-10 by the American Council Psycho- 
logical Examination received a letter grade of A; 10 percent of those 
placed in deciles 1-3 received this grade, as reported by McGehee (62). 
Within the years 1931-1934 Wesleyan University freshmen were given vari- 
ous scholastic aptitude tests. Letter rating from A (top 10 percent) to 
E (bottom 10 percent) were assigned to the scores. The following pro- 
portions of the original groups were graduated at the end of a four-year 
period: A, 82 percent; B, 83 percent; C, 60 percent; D, 39 percent; E, 
31 percent. Of the A group 3 out of 5 men were graduated with honors; 
1 out of 20 of the E group attained like distinction. These data were re- 
ported by Langlie (55). In addition, the reader’s attention is called to 
investigations by Kirkpatrick (48) and Rigg (87). 


Prediction of School Success 


The following studies are pointed especially toward prediction of school 
success, mostly college. Those cited in the preceding section are, of course, 
pertinent to this topic. A comparison has been made by Prescott and 
Garretson (82) between an intelligence test and teachers’ ratings, with re- 
spect to correlation with first semester college grades. The teachers’ esti- 
mates correlated .63 with grades; the intelligence test, .32. Keys (47) 
obtained a coefficient of .35 between Terman Group IQ’s earned by students 
in junior high school and their eventual grade-point averages in college. 
Those of the group who entered college had earned an average IQ of 
115.5; those who graduated, 118; those who graduated with honor, 125. 
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Leaf (56) reported a correlation of .57 between average college fresh- 
men marks and the American Council Psychological Examination, for 97 
students; a correlation of .56 between marks and the Iowa High School Con- 
tent Examination scores; and of .74 between marks and high-school marks. 
From a regression equation he predicted college marks and obtained an 
r of .77 between the predicted marks and actual marks. A coefficient of 
multiple correlation of .79 was reported. Attention is also called to the 
articles by: Charters (17), Langlie (55), Manning (69), Nemzek and 
DeHeus (75), Quaid (84), Read (85), and Traxler (104). 

With respect to adult education Brody (9) examined the claims that 
have been made by certain writers to the effect that educational achievement 
is more highly correlated with test intelligence than with the number of 
years of previous formal schooling. He maintained that a revaluation of the 
data tends toward the conclusion that the two factors are equally related 
to adult educational achievement. 

Two simple facts should be kept in mind in connection with the at- 
tempts to predict college marks from the intelligence test scores of high- 
school pupils or from any other such data. One is the unreliability of 
the college marks. The other is that college students are drawn very 
largely from the upper half of the high-school population, thus eliminating 
many potential failures—and also reducing the coefficients of correlation 
which are obtained. 


Relations between Test Intelligence and Various Factors 


Photographs—Cook (21) obtained coefficients of correlation ranging 
from —.06 to +.20 for various judges (personnel managers and social 
workers), between intelligence test performance and estimates of intelli- 
gence from photographs. The photographs were uniform in quality; the 
subjects, college men (freshmen). 

Knowledge—A study of the relationship between test performance and 
knowledge of world affairs of high-school pupils was reported by Blair 
(7). An r of .58 is given. The relationship did not change appreciably when 
fathers’ occupations were held constant. Kohn (49) obtained coefficients 
of from .54 to .82 between scores on various intelligence tests and scores 
on the Sones-Harry High School Achievement Test. A correlation was re- 
ported by Inman (41) between Otis MA’s and scores on a general knowl- 
edge test of .45, and of .41 between schoolwork and general knowledge. 
Watson (106) has studied the relationship between intelligence test scores 
and retention of course material. 

Musical talent—Correlations of from .12 to .26 between test intelligence 
and achievement on the Seashore Test of Musical Talent for 1,541 pupils, 
Grades V to XII, and of from .09 to .23 between Stanford Achievement 
and the Seashore scores were obtained by Ross (90). 
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Improvability in reading—McCullough (61) found no relationship be- 
tween test scores and response to remedial instruction at the high-school 
and freshman college level, pursuant to a 10-week instructional program. 
Twenty-four high-school and forty-nine college students were studied. It 
is a question whether any significant improvement in reading can be 
effected in so short a period. Moderate gains, however, were indicated here; 
moreover, the results do not seem to be due to the regression effect as do 
those of some investigations in work of this kind. 


Physiological Correlates of Intelligence 


Metabolic rate—For two hundred children, ages six to fifteen, Hinton 
(35) obtained a correlation of .71 between basal metabolic rate and 
Binet 1Q’s, and of .74 between the same variable and Arthur Point-Per- 
formance Scale IQ’s. These findings corroborate those of a previous inves- 
tigation conducted by the same author in which for 90 children coefficients 
of .74 and .66 respectively were obtained between basal metabolic rate and 
the same two intelligence tests (36). The first-named article gives separate 
correlations for each yearly age group. A marked and, to some extent, 
progressive decline in the magnitude of the coefficients begins at age ten. 
The author stated that this finding is in agreement with the fact that for 
adults there is no connection between basal metabolic rate and test intelli- 
gence. Investigations by Shock (92) and others have yielded much lower 
correlations than those reported by Hinton, being of the magnitude of .20 
to .30. 

Diabetic condition—A review of the literature and an examination of 
six cases by herself led Teagarden (103) to conclude that diabetic children 
give no evidence of intellectual superiority. On the contrary, there is 
some evidence of inferiority on their part. This observation is in line with 
G. D. Brown’s findings (11). The latter found that diabetic children 
tested slightly below their siblings, although the sampling was too small 
to be dependable. Some of the early work had given evidence of some 
superiority upon the part of diabetics. 

Effect of drugs—Molitch and Eccles (70) obtained gains of 7, 12, and 
15 percent in certain mental tasks subsequent to the administration of 
10, 20, and 30 mg’s of benzedrine sulfate respectively to three small 
groups of boys. Three matched control groups gained 3, 8, and 13 percent 
pursuant to the administration of an indifferent drug. The smallness of the 
net gain of the experimental groups over the control groups gives little 
support to the claim that the drug has a psychologically stimulating effect. 
This finding is in accord with results of McNamara and Miller (65). A 
stimulating effect of cobra venom upon both speed and accuracy of intel- 
lectual performance has been obtained, in the case of 25 college students, 
by Macht and Macht (63). They also found that morphine, codein, dilau- 
did, and heroin reduced the speed of mental processes. 
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Allergic condition—Chobat and others (18) in studying 169 allergic 
children found no association between this condition and test intelligence— 
which, they assert, is contrary to popular prediction. 

Rachitic children—Halleran (30) found a reliable difference between 
rachitic and nonrachitic children in rate of mental development, the 
advantage being in favor of the latter. In the verbal elements in the tests 
no differences were observed. The author suggested that the retardation 
noted is probably the result of significant motor retardation. 

Brain extirpation—Hebb (32) administered the Stanford-Binet and 2 
performance tests to four subjects who had undergone frontal lobectomy, 
the levy probably amounting to from 4 to 10 percent of the total cerebral 
mass. Three tested above normal, one somewhat below normal, after re- 
covery. In one case in which pre- and post-operative tests were adminis- 
tered, no important changes in test performance were observed. The results 
are taken to signify that the effect on test intelligence of frontal lobectomy, 
of the amounts involved, is at least not great. This observation is in accord 
with other findings, for example, those of Jefferson (42). A brief review 
of the literature is found in a recent article by Erickson (23). 

Electroencephalography—lt is generally observed that the frequency 
of the alpha waves and the percent of time they are present increase with 
chronological age up to about the tenth year, at which age they assume 
adult characteristics. In studying certain types of feeble-minded adults, 
ranging in mental age from 1.5 to 7.5 years, Kreezer (50) obtained 
evidence of a correlation between Stanford-Binet MA’s and electroenceph- 
alographic phenomena. The magnitude of the r’s varied with different 
classifications of feeble-mindedness. The latter could well be accidental 
owing to the smallness of the samplings, the smallest group containing 13 
cases, the largest 50. Lindsley (57) did not fully confirm Kreezer’s gen- 
eral findings and is inclined to believe that the fact of a correlation between 
EEE phenomena and MA, CA being constant, has not been fully established. 

Sex differences—A critical review of the literature on sex differences led 
its authors, Kuznets and McNemar (51), to conclude that the earlier and 
not overly well-founded predictions with respect to the absence of im- 
portant sex differences in general intellectual status have been vindicated. 
No significant differences are indicated either in general average or in 
variability. It may be said, however, that in some instances test items which 
show a differential appear to have been eliminated in the construction of 
batteries. The possibility of the presence of important sex differences in 
performance on certain kinds of items has not been eliminated. Addi- 
tional reference is made to articles by Livesay (58), Rigg (88), and 
Rusk (91). In the majority of the investigations boys are found to be 
slightly more variable, though the differences are in most instances 
unreliable. 

Nutrition—A significant gain—an average of 10 IQ points—is reported 
by Poull (81) in a group of 41 children pursuant to an improvement in 
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nutritional status. A control group, well nourished at the beginning and 
at the end of the experimental period, matched with the experimental 
group in CA and IQ made no average change. The nutritional status of 
both groups was determined by physicians’ examinations and hospital rec- 
ords. It is suggested, tentatively, that a period of from eighteen to twenty- 
four months is required to effect a significant change. 


The Deaf and Hard of Hearing 


Adequate appraisal of the intelligence of the deaf is as far from fulfil- 
ment as ever. Language tests are regarded as being very unsatisfactory 
and performance tests have for the most part correlated indifferently with 
the usual criteria of intelligence. Language tests are probably not entirely 
unsatisfactory. At least they make possible an appraisal of a subject’s 
ability to utilize language in intellectual situations; and it is difficult to 
see how any high order of intellection can go on in the absence of language. 
If one has as his goal the measurement of the innate endowment of the 
deaf, language tests are, of course, quite inadequate. It is a mistake to 
assume that achievement on a language test gives a true picture of the 
whole mental development of the deaf but it is equally erroneous to assume 
that a performance test score of a deaf subject gives an adequate appraisal 
of his mental ability. Even though he may achieve normally on a per- 
formance test we know at once that he is handicapped in all the higher 
mental processes because of his lack of facility in language. It is possible 
that a level of achievement on a performance test has different predictive 
value for deaf and normal hearing children. In some respects it is prob- 
ably less meaningful in the case of the deaf. 

Bridgman (8) wrote: “So far, we have not found one deaf child who, 
having failed badly on a scale of nonverbal tests, was able to make even 
fair progress in his schoolwork. On the other hand, there was a consider- 
able portion of deaf children in the group tested (17 out of 83), who 
showed normal and at times very superior ability on the nonverbal scales, 
but whose success in school subjects was no better than that of frankly 
mentally deficient children.” It is true, however, that other complicating 
factors, such as poor home background and behavior difficulties, were 
present. Zeckel and van der Kolk (114) found a group of one hundred 
congenitally deaf children to rank below a group of one hundred hearing 
children, matched in socio-economic status with the deaf, by about 10 
Porteus Maze IQ points. In efforts to appraise the mental ability of deaf 
children it would be desirable also to compare the performance of deaf 
children with that of their siblings. 

A comparison was made by Pintner and Lev (80) of the performance 
of 1,556 children having normal hearing with that of 1,404 hard-of- 
hearing children on verbal and nonverbal tests. The mean verbal test IQ 
of the normal hearing children was 101, that of the hard-of-hearing, 95, 
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and that of the very hard-of-hearing, 92. On nonverbal tests the 1Q’s were 
102, 99, and 99, respectively. Springer (98) obtained no significant differ- 
ences between the scores of 330 deaf and 330 hearing children (ages six to 
twelve years) on the Goodenough test (drawing a man). The reader’s at- 
tention is called also to an article by Lane (53). 
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CHAPTER IV 


Measurement of Aptitudes in Specific Fields’ 


DAVID SEGEL 


]\ ris piscusston the word “aptitude” is used in two ways—one as a 
measure of a special ability (for example, the measurement of visual 
acuity), and the other as a prognostic measure (for example, the use of 
a special test to forecast success in an occupation). The chapter will not 
cover all measures of these types since educational tests are dealt with in 
other issues of the Review and another chapter of this issue covers certain 
applications of intelligence tests. 


General Treatises of Vocational Aptitude 


O’Rourke summarized the literature on vocational aptitudes from 1935 
to 1937 in the June 1938 issue of the Review (88). Since that time several 
general evaluations and discussions have appeared. Most important was 
the one by Stead and others (109) which discussed research methods in 
occupational prognosis and gave results in the testing of aptitude for 
several specific occupations. The study described the technical activities 
of the Work Analysis Section of the U. S. Employment Service, which 
developed measures of proficiency for people already in occupations. 
Oral trade questions in 126 occupations have been developed. 

One limitation in the use of the results of these studies is that the tests 
have been ratified for specific occupations rather than for individuals 
seeking guidance, and the test batteries are therefore employment tests 
rather than general guidance tests. An employment test is one given by a 
prospective employer to see if a specific applicant has the requisite ability. 
A guidance test is a test which will help the individual determine what 
type of occupation he is best fitted to enter. An inspection test of the type 
described by C. A. Drake (26) is a good prognostic test for inspectors of 
metal piecework but it has no known value as a guidance test. A mechanical 
aptitude test on the other hand has some guidance value but is not a specific 
employment test. Some tests may be useful for both purposes; for example, a 
manual dexterity test is both a guidance test and a specific employment test. 
Many of the tests described in the volume by Stead and others may have 
guidance value, but until they are tried out in a more general situation their 
guidance value is not known. A second limitation to this study is that the 
persons taking the tests were already in the occupations when tested, and 
some of the traits which they possessed may have been obtained on the job. 
Not until it can be shown that the traits are needed by beginning workers can 
the presence or absence of the trait be used as a good guide for employment. 


1 Bibliography for this chapter begins on page 52. 


42 

















February 194] MEASUREMENT OF APTITUDES IN SPECIFIC FIELDS 





The interpretation of validity coefficients for guidance measures was dis- 
cussed by Taylor and Russell (114). They pointed out that the validity 
of a score for prognosis will vary with the critical score and the percent 
of persons usually successful on entering an occupation. Bell (6) reviewed 
the relationship between adjustment scores in college and occupational 
intentions, while Bingham (10) pointed out the need for the accumulation 
of research data over a period of years for good prognosis but cautioned 
against believing that interests and abilities are constant. General expositions 
of the use of measures for determining special aptitudes as well as more 
general aptitudes are those by Ruch and Segel (95), and Paterson, 


Schneidler, and Williamson (90). 


Aptitude for Specific Academic Fields 


The elementary field will be omitted because at that level, except for 
specialized tests for beginning school children, the best aptitude indicators 
are the achievements in the individual fields. 

Several investigations of the aptitude for algebra (15, 62, 79, 99) 
compared the value of general intelligence, arithmetic achievement, reading 
achievement, and the following special aptitude tests: lowa Aptitude Test 
for Mathematics, Orleans Aptitude Test for Algebra, and the Lee Test of 
Algebraic Ability. The conclusion reached was that the special aptitude 
tests are best, the arithmetic tests next best, closely followed by the intelli- 
gence tests. The other measures are distinctly less favorable as algebraic 
predictors. The results of these studies confirmed previous work. 

Traxler (117) and Segel and Proffitt (103) made extensive investi- 

gations to establish that marks in different subjects in high school and 
| college have differential and direct predictive value which is substantial. 
| Stuit and Donnelly (112) found that for a differential prognosis of success 
| for most of the academic subject groups in college, the following tests could 
| be used: lowa High School Content Examination, the lowa Mathematics 
Aptitude and Training Tests, and the Iowa Silent Reading Tests. These 
; ' results also are in accord with previous studies. Working with the College 
Board Scholastic Test, Dickter (21) found the mathematical parts much 
more predictive of college success in mathematics than was the verbal 
part. Greene (48) analyzed the prognostic value of mental tests, vocabu- 
lary tests, interests, and previous training for 458 students in a psychology 
course. He found that by using several factors a fairly satisfactory prog- 
nostic value could be obtained. 
: The Detroit General Aptitudes Examination (3) was developed for the 
first two years of high school to differentiate between aptitudes for general 
academic work, industrial arts, and various manual trade subjects and 
clerical course subjects. The materials of this test consist of motor per- 
formance, mechanical information, visual imagery, verbalization, and 
achievement in arithmetic, reading, spelling, and handwriting. 
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Interest indicators afford an indirect means of indexing aptitude for 
various academic subjects. Research on the relationship of interest scores 
to specific school courses has been mainly a development of the period 
here reviewed. The Michigan Vocabulary Profile Test (47) gives an 
indirect indication of interest in the following different types of school 
courses: human relations, commerce, government, physical sciences. 
biographical sciences, mathematics, fine arts, and sports. A study (49) 
of the profiles on this test for students who had chosen distinct occupa- 
tional courses showed that the test distinguished between graduate nurses, 
engineering students, and first-year medical students, but not between 
business administration seniors and graduate students in education. 

The Vocational Inventory developed by Gentry (45) covered vocational 
areas and used information rather than vocabulary. The areas covered 
are: social service, literary pursuits, law and government, business, artistic 
pursuits, mechanical designing, mechanical construction, and scientific 
pursuits. The value of the inventory in determining aptitude for college 
courses was supported by Gentry. Another measure of interest constructed 
by Kuder (63) can be used for both educational and vocational guidance. 
Its categories of interests are: scientific, computational, musical, artistic. 
literary, social service, persuasive. The items were made up from activities 
in which students might conceivably sometime engage, and are direct 
interest items instead of informational. The Dunlap Preference Blank was 
studied by Sharkey and Dunlap (106). It was found valid as an indicator 
of success in several different school subjects. Congdon (16) found that 
the Cleeton Vocational Interest Inventory was of value in differentiating 
between students of education and others. Duffy and Crissy (32) made a 
study of the relation of the evaluative attitudes—economic, political. 
theoretical, esthetic, social, and religious—to vocational interests and 
found significant trends between the two. 


Musical Ability 


The Seashore Tests of Music have been revised (101). Included as 
before are pitch, loudness, time, rhythm, and tonal memory, but a new 
test of timbre replaces the old consonance test. Semeonoff (104) experi- 
mented with the measurement of the appreciation of music through having 
students listen to ten phonograph records and select the best interpretation 
out of four alternates. The reliability of the tests was found to be high. 
Another method of appraisal of musical aptitude at the high-school level 
was developed in the high school of the University of California (70). 
Ten each of the following were used: rhythmic patterns, melodic pat- 
terns, dissonant and consonant chords, and differentiation in pitch and 
intensity. 

Seashore (100) discussed musical theories and (102) outlined the rules 
for the construction of a sight singing scale. Tests of ability to sing simple 
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phrases were found to be more discriminative at preschool ages and 
reflected progress better than singing single notes or two-note intervals, 
according to a study made by Updegraff, Heiliger, and Learned (119). 
Murrell’s criticism of the validity of music tests (82) has in turn been 
attacked by Kwalwasser (64). A study of the changes in harmonic sen- 
sitivity in children over a period of years by Farnsworth (38) showed 
that with increasing musical sophistication there was a tendency to brand 
fewer and fewer musical combinations as bad. 

Several analyses of the scores on musical tests by themselves and in 
conjunction with scores on other tests have been made in order to find the 
fundamental musical traits. R. M. Drake (29, 30) made factorial analyses 
of the scores on the Test of Musical Memory and Retentivity, Kwalwasser’s 
Test of Tonal Memory, and Seashore’s measures of pitch, rhythm, inten- 
sity, time, and tonal memory. More than one common factor was found. 
It was suggested that the major one was “memory for auditory items.” 
The five Seashore tests were not found to satisfy the criterion of division 
into independent measurements of isolated capacities. The relation between 
the intelligence test scores and scores on certain of the Drake Music Tests, 
the Seashore battery, the Lowery measure of cadence, and the Kwalwasser- 
Dykema Tests of Melodic Taste and Tonal Movement was found to be low 
by R. M. Drake (31). Morrow (80) found that scores obtained on tests of 
musical, artistic, and mechanical abilities did not show clear differentiations 
between the three fields. Kwalwasser (65) concluded that the correlation 
between intelligence and musical ability is low because of the regression 
effect. Larson (67) found a correlation of .59 between Seashore scores 
made in high school and later marks in a music school. 


Art Aptitude 


Varnum (121) developed an art aptitude test having exercises on color 
memory, tone graduations, proportioning, static balance, rhythmic balance, 
feeling for geometric form, and creative imagination. The subsections 
correlated from —.15 to +.31 with each other and from .18 to .62 with 
the total score; the reliability of the total test ran as high as .88 on one 
group. The test was validated through scores made by art students and 
persons in related occupations. This test is one of the most important apti- 
tude tests developed recently. The McAdory Art Test was revised for use 
with such racial groups as the American Indian by Steggerda and Ma- 
comber (110). It failed in this purpose because the racial group tested (the 
American Indian) based its judgment on the utilitarian rather than 
the artistic characteristics of the objects and situations pictured. The 
reliability of the Knauber Art Test was questioned originally, but Moore 
(78) found a reliability of .90 for it in a group of 158 college students 
and art teachers. Burt’s picture test for artistic appreciation was found 
by Dewar (20) to have the highest reliability and validity among several 
art tests. 
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Several studies were made dealing with the search for independent art 
traits. The most important of these is probably the over-all study and 
review made by Meier (72). He came to the conclusion that artistic apti- 
tude rests upon the possession of six factors: manual skill or craftsman 
ability, energy output and perseverance in its discharge, general and 
esthetic intelligence, perceptual facility, creative imagination, and esthetic 
judgment. Other studies made in this area are those by Eysenck (37) who 
found some evidence for a general factor of visual esthetic appreciation: 
and Dewar (20) who made a factor analysis of Burt’s picture test for artis- 
tic appreciation and found indications of a general artistic aptitude and 
suggestions of specifics. Lark-Horovitz (66) worked on the type of pictures 
preferred by boys and girls at different ages, while Miller (74) tried out 
intelligence, personality, and other measures on art students at the high- 
school level and found none of them to be related to dramatic ability. 


Visual Acuity and Auditory Testing 


The controversy over the best methods of testing the sense of sight has 
continued. The interpretation of the research results is complicated by cer- 
tain factors, one being that eye deficiency (so-called) is not always a 
liability. For example, myopia proves advantageous to persons doing 
certain kinds of work and possibly in facilitating reading. Studies of the 
relation of vision to efficiency in various school, occupational, and recrea- 
tional activities will no doubt reveal factors which will eventually differen- 
tiate among the procedures in testing vision in accordance with the objec- 
tives in view. 

Eames (34) found that his test of acuity of vision was reliable and valid. 
His test was 95 percent correct when judged by an oculist’s findings. Norms 
for various groups of students from kindergarten to college have been 
gathered by Betts (8) in the three efficiency slides designed for use in the 
Betts Telebinocular series. He reported many investigations of the use of 
the Betts Binocular Tests for vision in general and for specific purposes 
at varying age and grade levels. Hildreth and Axelson (53) adapted the 
Snellen E chart so as to make the testing a game for young children. At the 
college level, Frazer, Ogden, and Robinson (43) found that the Betts Tests 
of Binocular Skill were not reliable enough for diagnostic work. Molish and 
Reese (76) gave the Betts Test of visual efficiency and sharpness of image 
to 69 college students and then tested them in an optometric clinic. In sev- 
eral cases those failing to pass the Betts tests showed acuity of 20/20 on 
the Snellen test letters, while others who passed showed less than 20/20. 

On the elementary-school level, Oak and Sloane (85) concluded from 
using the Betts Visual Sensation and Perception Tests and checking with 
an ophthalmologist that the Betts tests sorted out far too many children for 
ocular attention while at the same time they missed some children needing 
attention. 
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Oak (84) compared one hundred pupils less than sixteen years old, 
handicapped in reading, with one hundred pupils of the same age, show- 
ing no handicap in reading, on the results of testing with the Betts cards 
(D B Series) and on examinations made by an ophthalmologist. He con- 
cluded from his findings that the Betts cards do not serve to screen out 
children who should be referred to an eye specialist. English and others 
(35) tested 485 third-grade children with the Betts telebinocular method 
and the method described in the report of the joint committee of the N.E.A. 
and A.M.A. (83). The results indicated that the latter method was superior 
to the former. The higher error in the Betts method was attributed to 
failure to detect myopics. The relation of the measures obtained from the 
Betts Tests of Visual Sensation and Perception and Ives Test of Acuity and 
Ametropia to reading speed for students at the college level was found by 
Stromberg (111) to be insignificant. The work of Hitz (55) and Schwartz 
(97) also indicated the inadequacies of the present instruments to deal 
adequately with the testing of vision in the schools. The special testing 
of the change or adjustment of sight for changes of intensity of light and 
ability to change from looking at near objects to distant objects and vice 
versa, such as is required of airplane pilots, has been described by Ferree 
and Rand (39, 40). 

A comparison of tests of color blindness was made by Philip (91). He 
found correlations of from .50 to .90 with 42 cases of defective color vision 
between the following tests: Ishihara, Edridge-Green, Nagel, Holmgren 
wools, Philip color perception, and the new edition of the Ishihara. The 
Ishihara and the Philip tests had certain advantages. Philip (92) also 
established that errors in distinguishing colors were made more fre- 
quently by boys and men than by girls and women. 

Hearing—The efficiency of most of the several types of hearing tests in 
current use was tested and discussed by Holmgren (57) while Silverman 
(107) investigated thoroughly the 4-A and 2-A individual audiometer. 
The 4-A audiometer was found inadequate in that all the elements of 
English speech were not included and the 2-A individual audiometer was 
inadequate because it was not calibrated finely enough to insure a com- 
plete picture of the child’s hearing loss. An analysis of the World’s Fair 
hearing tests by Montgomery (77) verified the general findings of hearing 
in regard to age and sex. 


Mechanical and Manual Dexterity Tests 


In this section is included a discussion of studies of manual, motor, and 
physical aptitude, as well as those of mechanical aptitude. It is hoped that 
a new and better classification of these different traits will soon be at- 
tempted. As mental and physical traits are more clearly defined the tests 
will tend to fall into correct categories. 

Packard (89) found that mechanical aptitude in pupils of high-school 
age could be measured best by a combination of (a) intelligence, (b) aca- 
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demic achievement, (c) grade level, (d) mechanical ingenuity, (e) coor- 
dination, (f) manipulation, (g) spatial perception, and (h) construction, 
A different approach to the same problem was offered by Harrell (51) who 
made a factorial analysis of mechanical ability tests including the Minne- 
sota battery of mechanical tests, the MacQuarrie Test of Mechanical Abil- 
ity, the O'Connor Wiggly Blocks, and the Stenquist picture-matching test. 
He found five factors present—perceptual, verbal, youth, manual agility, 
and spatial. More evidence as to the multi-trait nature of mechanical apti- 
tude was presented by Slater (108) who found that valid tests of this apti- 
tude were saturated with spatial relationships. Hayes and Drake (52) found 
no relation between results of the MacCauley Tetrahedron Test and ability 
in descriptive geometry. 

The motor performance of 80 girls and 85 boys was followed over a 
period of years by Espenschade (36). Correlations between motor per- 
formances and all measures of physical growth and maturity were low for 
girls but the reverse was true for the boys. Intercorrelations among motor 
performances were all positive but varied in magnitude. O’Connor (86, 87) 
gathered further norms on his Block Cube and Finger and Tweezer Dex- 
terity Tests. Van Der Lugt (120) developed a series of tests for the study 
of motor functions consisting of speed of performance in (a) threading 
of beads, (b) punching holes in a sheet of paper, (c) pressure sense, (d) 
precision, and (e) motor memory. Other studies in which norms were 
developed for achievement scales in physical activities were those by 
Metheny (73), McCaskill and Wellman (69), Powell and Howe (93), 
Glassow, Colvin, and Schwarz (46), and by several Wellesley graduate 
students (115). The distribution of hand usage dextrality was further 
developed during this period by W. Johnson and others (60, 61). 


Manual Semi-Skilled and Skilled Trades 
C. A. Drake and Oleen (28) believed it possible to select with high 


efficiency employees for factory type work. Evidence confirming C. A. 
Drake and Oleen’s optimistic analysis is found in a study by Hiscock (54) 
in England; testing situations like the actual work were used to select 
workers. Similar results were found in Sweden by Anderberg and Wester- 
lund (2) who constructed a miniature weaving machine for use in secur- 
ing a measure of textile learning rate. C. A. Drake (26) used a twisted rod 
inspection test, a pin board test, and a paper and pencil design test with 
success in the selection of inspectors of factory work. Tiffin and Greenly 
(116), however, found that the Keystone Visual Safety Tests, a hand move- 
ment test, and the O’Connor Finger Dexterity Test had little validity in- 
dividually in testing aptitude for simple manual factory operations. A 
combination of the three produced a validity coefficient of .60. A maze test 
was found useful in discovering aptitude to follow electric wiring pat- 
terns (94). 
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The foregoing tests are known as employment tests since they are 
directly related to a particular job. In an attempt to find tests which might 
be useful in the discovery of persons with an aptitude for larger areas of 
skill, Slater (108) and Holliday (56) found that ability in various space 
and form relationship tests was basic to aptitude in the mechanical trades. 
They found no evidence, however, of a special mechanical aptitude such 
as is given by scores on “mechanical aptitude tests.” They concluded that 
“mechanical aptitude tests” measure to some extent spatial relations and 
also general intelligence. 


Clerical Aptitudes 


The new Turse Shorthand Aptitude Test (118) was constructed from tests 
of manual dexterity, spelling, phonetic association, symbol transcription, 
word discrimination, dictation, and word sense. Hales (50) gave the 
Minnesota Vocational Test for Clerical Workers and Thurstone Examina- 
tion for Clerical Workers to 129 inmates of a Minnesota reformatory for 
men who were studying clerical subjects or doing clerical work. The cor- 
relations of the test results with the supervisors’ ratings were low—averag- 
ing about .35. Davidson (18) made an evaluation of the following clerical 
tests: Bureau Test VI, Thurstone Clerical Test, a modification of the Thur- 
stone Clerical Test, Minnesota Vocational Test for Clerical Workers, 
O’Rourke Clerical Aptitude Test—Junior Grade, O’Rourke Clerical Apti- 
tude Test—Senior Grade. He compared the results of the tests with 
supervisors’ estimates and with promotability as indicated by the level of 
job attained at the end of a given period of employment. The validity 
coefficients, in terms of supervisors’ ratings, ranged from .27 to .44. The 
validity coefficients, in terms of promotability, ranged from .07 on the 
number checking part of the Minnesota test to .77 on the O’Rourke Clerical 
Aptitude Test—Senior Grade. This kind of validity is limited by the fact 
that the testing was done on persons already working in the clerical field. 
The high standing of the O’Rourke Clerical Aptitude Test can more easily 
be attributed to the training received on the job than to inherent aptitude. 
Stead and others (109) reported fifty-one validity correlations for tests in 
a number of clerical occupations. The criterion of success in most cases 
was a direct production record rather than supervisors’ ratings. The 
coefficients ranged from .35 to .68, two-thirds of them being below .50. 
The limitation of the method involved in deriving these validity coefficients 
is the same as for those quoted for Davidson. 


Professional and Semi-Professional Pursuits 


Dwyer (33) made an analysis of 19 occupational scorings of the Strong 
Vocational Interest Test given to 418 students entering medical school and 
found that the scores yielded for four “key” occupations—physicist, journ- 
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alist, minister, and life insurance salesman—explained most of the scores 
on the 19 original occupations. Regression equations using the scores for 
these four “key” occupations predicted scores for most occupations with 
multiple correlations of .80 or better. A study of the variation in types 
of ability in relation to the type of institutions was made by Bryan and 
Perl (14). They tested women students in the Pratt Institute (an art 
school), the Institute of Musical Arts (a school of music), and New Col. 
lege of Columbia (an undergraduate school of education), with the 
Bernreuter Personality Inventory scored for neurotic tendencies, a test of 
rote memory, a motor speed test, and the American Council Psychological 
Examination. The students in the three different colleges were significantly 
different in some of the traits. 

Dentistry—A battery of 34 items used for selecting dental students was 
described by Bellows (7). 

Education—Successful educators were found to be superior in intelli- 
gence by Shannon (105). However, Shannon studied administrators and 
supervisors with teachers, and judged success by promotability which is 
perhaps more of a measure of executive ability. Barr (4, 5) and Mathews 
(71) analyzed the relation of teachers’ scores on a large number of tests 
to changes in achievement in their pupils. None of the measures proved 
very significant. The two found to be of most value were the American 
Council Psychological Examination and Yeager’s Scale for Measuring 
Attitudes toward Teachers and the Teaching Profession. These studies 
like many others on the adult and college level suffer from attempting to 
interpret ability on the job as aptitude for the job. 

Engineering—Laycock and Hutcheon (68) gave 144 students of the 
freshman engineering class a number of tests during their freshman year. 
The results were correlated with the average grades obtained during that 
year. The following correlations were obtained: American Council on 
Education Psychological Examination, .34; last-year high-school grades, 
.61; Cox mechanical aptitude test (models), .16; a paper formboard test, 
.25; physical science interest (Thurstone) , .26. By careful selection of these 
tests a multiple correlation of .66 with the criterion was obtained. 

Medicine—Most of the studies (12, 22, 24, 81) carried on with the Moss 
Medical Aptitude gave results supporting the view that it is superior to 
any other methods in the selection of medical students. Marks in premedical 
education, however, have also been found of value. 

Nursing—Williamson, Stover, and Fiss (123) determined that a compre- 
hensive testing program consisting of (a) a college aptitude test (vocabu- 
lary), (b) the Moss Nursing Aptitude Test, (c) the Cooperative English 
Test (usage and spelling), and (d) a Cooperative General Science Test, 
was a fairly valid measure of aptitude for nursing. Williamson also made 
some pertinent observations regarding the effect of varying marking and 
rating systems in different nursing training schools upon the validity of 
coefficients. Garrison (44) studied the relation of the scores on the Bern- 
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reuter Personality Inventory to ratings on student nursing practice. He 
also used the Otis Intelligence Test, the Detroit Mechanical Aptitude Test 
for Girls, and the lowa Reading Test. Correlations of these measures with 
nursing practice ratings were .59, .37, and .55. 


Miscellaneous 


Biesheuvel (9) found that perseverators had a lower threshold for flicker 
than nonperseverators. In the ability to recognize faces, Howells (58) found 
some indication that women were superior to students and farm people 
and that fraternity people were superior to nonfraternity people. The 
validity of the Noll Test of Scientific Thinking was tested by Blair (11) 
by using the test taken by recognized scientific authorities. The results 
showed the validity of the test to be questionable. The use of the lie de- 
tector as an accurate measure was questioned by Forkosch (42) and Ruck- 
mick (96). The latter found in a series of investigations that the detector 
was only 83 percent correct. 

Salesmanship—A study of the factors making for success in sales work 
was made by Mitchell (75). He found that a vocabulary test, a word 
association test, a word series test (giving the names of as many things that 
begin with “s” as possible in one minute), and an ink blot test were of some 
significance. Wallace and Travers (122) also worked on this problem. 
They found that specialty salesmen were highly obsessional. It is, of 
course, another thing to say that such a trait is necessary before entering 
into employment. Dodge (23) concluded that social dominance is not 
associated with sales ability because he found a correlation of only .16 + .16 
between this trait and success in selling. This conclusion is contrary to 
the previous work in this area. 

Aviation and automobile driving—The Waltring Rotoscope, the Key- 
stone Tel-Eye-Trainer and Stereoscope, and the American Optical Master 
Model Stereo-Orthopter were used to advantage in the testing and training 
of aviators according to Schwichtenberg (98). Swope (113) prepared a 
test dealing exclusively with judgment factors in automobile driving. The 
selection of the items was made on the basis of opinions as to the value 
of the item obtained from commercial and noncommercial drivers and a 
cross section of university students. Allgaier (1) analyzed the results of 
15 tests administered to 22,000 drivers in 70 cities and found that the 
abilities required for safe driving were most highly developed between 
the ages of twenty and forty. Other testing programs for automobile drivers 
were described by C. W. Brown and by others (13, 19, 41, 59). C. A. 
Drake (27) concluded that accident-proneness is associated with the 
discrepancies between perception and motor reaction and that the dis- 
crepancies can be determined by tests. If Drake’s conclusion is sound it 
means that automobile drivers could be made aware of their weakness 
and industrial workers placed in jobs with due regard for their accident- 
proneness. 
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CHAPTER V 


Current Construction and Evaluation of 
Personality and Character Tests’ 


ARTHUR E. TRAXLER 


Tus CHAPTER is concerned with the construction and evaluation of tests 
of personality and character. Projective methods are reserved for the fol- 
lowing chapter, and survey studies appraising the personality and char- 
acteristics of various groups are reported in Chapter VII. The present 
chapter falls naturally into six divisions: (a) adjustment inventories and 
questionnaires for broad aspects of personality; (b) interest inventories and 
checklists; (c) investigations of attitudes and opinions; (d) measurement 
of persistence; (e) investigations concerned with rating scales; and (f) 
miscellaneous. Within the larger divisions, there are three kinds of studies: 
(a) those reporting the construction and validation of new instruments, 
(b) those appraising specific tests already available, and (c) those dealing 
with various questions concerning technics of testing. 

Recent summaries and bibliographies—In the Review or EpucATIONAL 
ResEARCH for June 1938, Watson (144) presented a three-year summary 
and bibliography of personality and character measurement including 
references to more than three hundred investigations. Nelson (88) reviewed 
the literature on attitude measurement and gave a bibliography of 183 
titles. On the basis of an analysis of papers read at the 1938 convention 
of the American Psychological Association, Stagner (125) indicated 
trends in research upon character and personality. More recently, Bern- 
reuter (8) presented an overview of the nature and uses of personality tests. 
Traxler (136) described and evaluated the personality measures judged 
to be most common and most useful. The construction and use of some 
instruments was described by Koos (69). Bibliographies of personality 
tests are included in the general test bibliographies referred to in the first 
section of Chapter I. 


Personality Inventories 
New and Revised Tests Yielding Several Scores 


Link (78) published the 1938 revision of his PQ, or Personality 
Quotient Test. It is similar to the 1936 form, in that it yields an over-all 
score for personality and separate scores for social initiative, self-determi- 
nation, economic self-determination, and adjustment to the opposite sex. It 
also gives a PQ, or personality quotient. Odd-even reliabilities (1936 
edition) corrected by the Spearman-Brown formula are from .73 to .88.. 


1 Bibliography for this chapter begins on page 73. 
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Cowan (21) made the second revision of his Adolescent Personality 
Schedule available and standardized it on about twelve hundred children. 
twelve to eighteen years of age. It is designed to measure maladjustment 
in nine fields. Bell (6) published the adult form of his Adjustment In. 
ventory. It gives scores for home adjustment, health adjustment, social 
adjustment, emotional adjustment, and occupational adjustment. The odd- 
even coefficients of reliability predicted with the Spearman-Brown formula 
range from .81 to .94. 

Thorpe, Clarke, and Tiegs (134) published the California Test of Per- 
sonality, which is planned to measure personal adjustment and social ad- 
justment of pupils in Grades IV-IX. There are several subtests within each 
main part. The results may be graphed in the form of a profile. The 
split-half reliability stepped up by the Spearman-Brown formula is .93 
for the whole scale. Pintner and others (98) also prepared a personality 
inventory for Grades IV-IX. The average retest reliability for four ad- 
ministrations of the test to one hundred pupils in Grade V is given as 
follows: ascendance-submission .71, introversion-extroversion .70, and emo- 
tionality .72. Pintner and Forlano (97) validated the test by the out- 
standing characteristics of the pupils as reported by their teachers and 
concluded that the technics afforded a rough measure of the validity of 
the test. 

On the basis of factor analysis studies, Guilford (47) developed an in- 
ventory for five factors which he called S, social introversion; T, thinking 
introversion; D, depression; C, cycloid tendencies; and R, rhathymia or 
happy-go-lucky disposition. Two sets of reliability correlations which had 
been corrected by the Spearman-Brown formula were relatively high. The 
lowest correlation coefficient was .84 and the highest was .94. Washburne 
(141) published his Social Adjustment Inventory for diagnosis in clinics 
and counseling in secondary schools and colleges. The scoring device 
divides the test into eight subtests. Retest reliability of the entire instrument 
after an interval of one semester was .92. 


Personality Measurement of the Survey Type 


Remmers, Whisler, and Duwald (101) described the construction of a 
personality test for the adolescent level. A test of child personality was 
prepared by Baxter (5) and standardized for Grades I-VIII. This is one 
of the very few personality tests that attempts to cover the entire range of 
grades in the elementary school. The split-half reliability predicted by the 
Spearman-Brown method was .92. A psychosomatic inventory divided into 
two parts according to physiological functions and psychological functions 
was described by McFarland and Seitz (81). Percentile norms for men and 
women are intended to differentiate normal from neurotic individuals. The 
reliability of the total score is reported as .87 in the manual of directions. 
On the basis of the qualities most frequently mentioned in the literature 
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as contributing to an individual’s social proficiency, Jackson (64) devised 
and validated a test of social proficiency. The reliability coefficient stepped 
up by the Spearman-Brown formula was .86. The author stated that con- 
sideration for others seemed to be the foundation of social proficiency. 
Mitrano (86) described the preliminary work that had been done in de- 
vising a schedule to measure emotional stability in children and reported 
a reliability coefficient of .77 based on eighty-two cases. To investigate the 
emotional difficulties of students, Manuel, Adams, and White (85) pre- 
pared a blank consisting of thirty questions printed on one side of an 
answer sheet designed for the International Scoring Machine. The blank 
may be used in schools as well as colleges. The Spearman-Brown split- 
half reliability is about .77. Emme and Henry (30) designed an inventory 
to measure the affection or dislike of students for their parents and gave 
information about its validation. A manual for Louttit and Carter’s psycho- 


diagnostic blank was prepared by Carter (15). 


Studies of the Bernreuter Personality Inventory 


The already extensive bibliography relating to the Bernreuter inventory 
has been augmented considerably. Jarvie and Johns (66) found that the 
Bernreuter inventory offered little aid to the study of personality problems 
in the Rochester Athenaeum and Mechanics Institute. Stogdill and Thomas 
(126) found the Bernreuter Personality Inventory “very helpful in dis- 
criminating between well-adjusted and maladjusted students” in connection 
with a Student Psychological Consultation Service. Hathaway (54) showed 
the value of the Bernreuter inventory in diagnosing the adjustment diffi- 
culties of individuals classified as “constitutional psychopathic inferiors.” 
St. Clair and Seegers (114) supplemented an earlier study of the Bern- 
reuter scoring keys with a study of the validity of the Flanagan scoring keys. 
The data were obtained from 1,162 college students. The results showed 
considerable validity for the F-1 score as a measure of self-confidence. The 
authors indicated that the F-2 score was inconsistent as a measure of soci- 
ability and that the B-1, B-2, and B-4 scores provided a more refined 
analysis. 

Nemzek (89) investigated the value of the B1-N, B2-S, and B4-D scales 
of the Bernreuter Personality Inventory for the prediction of academic 
success of secondary-school pupils as measured by teachers’ marks. The 
scores on the scales were found to be of little value in predicting achieve- 
ment in the various school subjects. In studies of the stability of scores of 
college students on the Bernreuter inventory, Farnsworth (32) and Kirk- 
patrick (68) reported retest correlations of scores obtained at intervals 
of a year or more. For an interval of one year, Farnsworth found r’s rang- 
ing from .70 to .77, for two years from .56 to .74, and for three years 
from .44 to .72. In Kirkpatrick’s article, the correlations between scores 


at the beginning of the freshman year and the end of the sophomore year 
59 











REVIEW OF EDUCATIONAL RESEARCH Vol. XI, No. ] 





averaged about .7. These correlations indicate that whatever is measured 
by the Bernreuter inventory is fairly stable, although not exceptionally 
so. Farnsworth and Ferguson (34) reported the scores on two adminis. 
trations of the Bernreuter inventory to a college student who subsequently 
committed suicide. 


Studies of the Humm-Wadsworth Temperament Scale 


Rather detailed studies of the Humm-Wadsworth scale were reported 
by Kruger (71) and by Dysinger (28), and the conclusions of both studies 
were somewhat unfavorable to the scale. Kruger analyzed the intercorre- 
lations between the components of the scale on the basis of the scores 
of 437 men who consulted the Adult Guidance Service in Los Angeles 
and concluded that the intercorrelations for the components of tempera- 
ment, with the exception of normal, are not in accord with the theory 
that such identifiable syndromes exist. She recognized, however, that there 
were certain limitations in the composition of the group studied. Dysinger 
found that the data from 307 university students were different in several 
ways from those used in the original standardization of the scale. There 
seemed to be an undue concentration of scores at one extreme or the other 
in several of the components. However, the reliability of the scores was 
high and the low intercorrelations between the components indicated that 
the scale was actually probing various phases of personality. 

Humm (59, 60) published replies to both articles. He criticized Kruger’s 
analysis from the standpoint of sampling, criteria, and mathematical treat- 
ment. He pointed out that Dysinger had not taken into consideration all 
the statistical data that the authors had made available in the manual. He 
called attention to the recommendation that the scales be accepted or 
rejected on the basis of the proportion of no-responses. He also suggested 
that Dysinger’s correlation fields might be curvilinear. With respect to the 
point concerning the influence of no-responses on the data, Humm, Stor- 
ment, and Iorns (61) recently published regression equations for com- 
binations of scores which were intended to counteract the tendency of the 
component to vary unduly with a high proportion of no-responses. Hem- 
sath (55) described the use of the Humm-Wadsworth temperament scale 
in connection with the employees of a bank and presented and discussed 
six illustrative profiles. 


Studies of Other Personality Inventories 


Thomson (130) summarized and interpreted the scores of 259 high- 
school pupils at Mooseheart, Illinois, on the PQ, or Personality Quotient 
Test. Among the results, it was reported that pupils with high PQ’s have 
a slight advantage in academic competition and that there was no statistical 
evidence for the assumption that low PQ’s are associated with problem 


behavior. Roslow (108) outlined a plan used in establishing the validity 
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of the PQ test by means of tests administered in fifty high schools and 
colleges throughout the United States. A criterion of personality involving 
leadership and social cooperation was employed. Pedersen (94) investi- 
gated the validity of the Bell Adjustment Inventory on the basis of scores 
and ratings of 380 freshman women at the University of Rochester. Evi- 
dence of validity was found for the home adjustment, health adjustment, 
and social adjustment scales, but there seemed to be no difference between 
the emotional scores of those rated maladjusted emotionally and the 
other students. 

Harriman (50) reported a study of the predictive value of the Wood- 
worth-House Mental Hygiene Inventory on the basis of an analysis of the 
records of forty-seven college students whose test blanks indicated a large 
number of problems. The practical advantages of the inventory were ques- 
tioned because of the fact that the scores were predictive of subsequent 
maladjustment in not more than 30 percent of the cases. Wolf (153) com- 
pared the scores made on the Woodworth-Cady Personal Data Sheet and 
on Baker’s Telling What I Do Test by two groups of girls that were almost 
equal in intelligence but that differed greatly in academic achievement. 
The Woodworth-Cady questionnaire made statistically significant differ- 
entiation between the success and failure groups. In the case of the 
Telling What I Do Test, there was a strong probability of a true difference. 
Two articles on revisions of the A-S Reaction Study were published. 
Ruggles and Allport (109) reported a revision of the form for women 
and gave new data about the reliability, validity, and uses of the scale. 
The reliability through the application of the Spearman-Brown formula 
was found to be .90. Schultz and Roslow (116) described a restandardiza- 
tion of the business revision of the A-S Reaction Study. The authors pointed 
out that the previous use of the business revision had been limited because 
the scores were spread over a large range and tended to be distributed 
rectangularly. The restandardization yielded a distribution of scores that 
seemed to approximate the normal curve. 

‘Several persons have recently reported studies in which two or more 
personality inventories were administered in such a way as to make 
possible certain comparisons between them. Wasson (142) gave the All- 
port A-S Reaction Study, the Willoughby Emotional Maturity Scale, the 
Humm-Wadsworth Temperament Scale, and a Case-Study Questionnaire 
to ninety-three men students in a university and studied the interrelation- 
ships of the scores on the different scales. Farnsworth (33) administered 
the Willoughby Emotional Maturity Scale, the McNemar-Landis modifica- 
tion of this test, the Pressey Interest-Attitude Test, and the Landis Ques- 
tionary to groups of college sophomores and correlated the results. None 
of the intercorrelations were significant except a correlation coefficient of 
46 between the original and the modified Willoughby scale. It appeared 
that the tests were not measuring the same variable. J. Greene and Staton 
(45) administered the Willoughby scale together with the Bernreuter 
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inventory, the Bell inventory, three tests for teaching aptitude, and four 
supplementary measures including grades, intelligence-test scores, the 
Wrenn Study-Habits Inventory, and Sims’ Socio-Economic Status to one 
hundred students in the College of Education at the University of Georgia. 
It was found that only nine of thirty-six correlations between teaching. 
aptitude measures and measures of emotionality and adjustment were 
statistically reliable. 

Peters (95) studied the extent to which scores based on the Bernreuter. 
the Bell, and the Link personality inventories agreed with the behavior of 
university freshmen as observed by others. “High” and “low” classes were 
determined for each of nine traits and one over-all trait, and a new technic 
for computing hiserial r’s from widespread classes was used. The validity 
correlations ranged from —.07 to +-.50 and averaged about .26. Harris 
and Dabeistein (51) studied the Maller Case Inventory and the Boynton 
B.P.C. Personality Inventory on the basis of the scores of 421 pupils in 
Grades V to IX, inclusive. A factor analysis by the Thurstone method in- 
dicated that three general factors, or possibly four, would account for the 
relationships among the subtests of the Maller inventory, the keys of the 
Boynton inventory, and mental age and chronological age. 


Simplified Scoring of Personality Tests 


Certain personality inventories are scored with several different scales 
in which differential weights are applied to the various test items. Since 
this is often a time-consuming and laborious procedure, attempts to sim- 
plify the scoring are naturally made from time to time. For instance. 
Bennett (7) reduced the weights of the items in the two Flanagan scales 
of the Bernreuter inventory to zero, one, and two instead of the regular 
range from minus seven to plus seven, and rescored 115 inventories with 
the simplified scales. The scoring with the simplified scales correlated 
with the original scoring to the extent of .97 for the FI-C scale and .98 
for the F2-S scale. New regression equations for determining the Bern- 
reuter scores were also prepared and the results studied. A short scoring 
method was worked out for the Link PQ test by Gibbons (43) but this was 
accomplished by mechanical procedures rather than by making a funda- 
mental change in the original scoring method. A procedure was set up 
for using special scoring strips and two Veeder counters in such a way 
that the scoring time for the PQ test was reduced from twenty to eight 
minutes. 


Validity of Self-Estimates of Personality 


Since most personality inventories call for self-estimates on a series of 
items, the question of the correctness with which individuals ordinarily 
make judgments concerning their own personality characteristics is an 
important question. Tryon (137) reported on the basis of a verbal portrait- 
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matching test that there was a tendency for students to look more favorably 
upon their own personality qualities than their peers did, but that this 
tendency varied widely among the traits. Crook (23) reported a similar 
conclusion as a result of administering the Willoughby Personality Schedule 
to two sections of a class in elementary educational psychology and then 
asking the students whether they felt that the changes in their personality 
had been in a favorable direction. It was felt that the data indicated that 
most people are overly optimistic in estimating the trend of their per- 
sonality development. On the other hand, Robertson and Stromberg (104), 
using the Royer Personality Inventory, found that college junior and 
senior women did not rate themselves significantly higher on the average 
than they were rated by friends. 

Spencer (122) administered a personality questionnaire to high-school 
juniors and seniors using care to preserve the anonymity of the responses. 
After the questionnaire was completed, the pupils were asked to indicate 
whether or not they would have answered all questions truthfully and 
willingly if they had been required to sign their names. Only 43 percent 
of the total population answered affirmatively. The author concluded that 
if the pupils had been required to sign the questionnaire, the purpose 
of the instrument would have been invalidated. Lentz (76) studied the 
effect of acquiescence, or the tendency to agree rather than disagree to 
propositions in general, on personality measurement and concluded that 
acquiescence may be a very distorting factor. It was indicated that no 
solution has been found to this problem except the employment of the 
double-presentation method, which is cumbersome for general use. 


Stability of Scores on Personality Inventories 


Pintner and Forlano (96) administered the Aspects of Personality 
Inventory to fifth-grade pupils four times at intervals of two weeks. Inter- 
correlations between the scores on the separate trials varied from .61 to 
.83. Robertson and Stromberg (105) gave the Royer Personality Inventory 
to forty-six college women in September 1935 and again in January 1938. 
The mean score changed in the direction of the more dominant, extro- 
verted, non-neurotic person. Hertzman and Gould (57) studied the 
functional significance of changed responses in a psycho-neurotic inventory. 
They used forty-two items selected from the original Woodworth personal 
data sheet and administered them twice to 147 women students with an 
interval of four weeks between administrations. The responses were changed 
most frequently to items in which the word “often” was used. 


Introversion-Extroversion 


The term “‘introversion-extroversion” is one of the most common concepts 
in personality measurement and yet it is one concerning which there. is 
not by any means entire agreement. Abernethy (1) administered an in- 
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ventory consisting of forty-four questions selected from tests of introver. 
sion-extroversion to 289 college students and 124 adults and attempted to 
determine whether there really exists a marked negative correlation between 
“liking thought” and “liking people.” The correlations obtained were very 
low. The author concluded that the data did not substantiate the popular 
assumption that interest in people is incompatible with interest in thought, 
planning, and detailed observation. Collier and Emch (18) asked psy- 
chologists to classify items from seven representative tests of introversion- 
extroversion according to the degree of introversion or extroversion each 
item seemed to express. There was considerable variation of opinion as to 
whether the items described introversion or extroversion. Three tests con- 
sisting of introversion-extroversion items were administered to students 
and critical ratios were determined for the different items and compared 
with the ratings of the judges. The agreement was not close. 


Factor Analysis in the Study of Personality 


Reference has already been made to certain recent studies of personality 
tests in which factor-analysis technics were employed. Several other 
studies of this kind may be cited. In an attempt to bring out more clearly 
the primary traits or dimensions of rhathymia and of thinking introversion- 
extroversion, Guilford and Guilford (48) administered a set of eighty-nine 
personality questionnaire items to one thousand students and analyzed 
the intercorrelations between thirty of them by Thurstone’s method. Nine 
primary factors were found, six of which were identified as: D, depression; 
R, rhathymia; S, shyness or seclusiveness; T, habitual thinking of a 
meditative sort; Lt, liking for thinking of the problem-solving kind; and 
A, alertness. Guilford and Guilford (49) analyzed twenty-four items 
designed to bring out differences in hyperactivity. The analysis indicated 
that there were probably four dimensions of hyperactivity-hypoactivity. 
Two of them were identified as N, nervousness or jumpiness, and GD. 
general drive, while the other two could not be identified. 

Brogden (14) attempted to determine what character traits were in- 
volved in the scores of one hundred sixth-grade boys on a group of forty 
tests purporting to measure various aspects of character, intelligence, and 
personality. Eight factors seemed to be involved, among which were the 
following five character factors: a persistence factor, a factor that seemed 
to be related to the w factor of the Spearman school, a self-control factor. 
an honesty factor, and an “acceptance of the moral code” factor. Reyburn 
and Taylor (102) selected nineteen of the traits used by Webb in his 
character analysis for further factorial analysis by Thurstone’s method. 
Four personality qualities were isolated: a factor similar to Cattell’s 
surgery-desurgence, a factor resembling perseverance, a factor which the 
authors called charity in the sense that the term is used in the Epistle 
to the Corinthians, and a factor called social sensitiveness. McNamara 
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and Darley (82) reported a factor analysis of test-retest performance of 
university students on the Minnesota Scale for Survey of Opinions, the 
Bell Adjustment Inventory, and the Minnesota Inventories of Social Atti- 
tudes. Adjustment to authority, socialized interests, and economic con- 
servatism were among the factors isolated. 

Gannon (40) sought to determine the dominant groups of personality 
traits among college men. The data were factored by the Spearman and 
Thurstone technics and yielded five groups of traits. Three of these belong 
to the introverted category and two to the extroverted category. In view 
of the popular tendency to regard extroversion as desirable and introver- 
sion as undesirable, it is interesting to note that while the first extroverted 
group implied adequate adjustment, the other extroverted group repre- 
sented a maladjusted trend generally characterized by troublesomeness. 


Analysis of Items 


Rundquist (110) concluded that the negative or “unacceptable” type 
of item is more valid than the positive or “acceptable” type. Layman (75) 
made a critical analysis of 782 test items taken from sixteen personality 
tests. The results suggested that “very few personality test items are such 
that they will present an adequately discriminative picture of an indi- 
vidual’s behavior tendencies or personality ‘traits’” (75:104). The most 
reliable items were of three types: (a) those which might be suggestive 
of abnormal tendency, (b) those which do not permit a variety of inter- 
pretations, and (c) those referring to behavior which does not change 
within short periods of time. 


Experiments with Unusual Approaches 


Trawick (135) set up and applied a procedure for selecting trait-con- 
sistent individuals. He reported that trait-consistent personalities tend to 
possess insight and to be self-confident, objectively modest, and goal 
seekers. McQuitty (83) used responses of college students and psychotic 
patients to certain questions in the Bernreuter Personality Inventory and 
the Strong Vocational Interest Blank in the development of indexes of 
concomitance of egocentrics—that is, relationship between self-concepts 
and objective concepts. He found well-interrelated egocentrics in the stu- 
dents and uninterrelated egocentrics in the psychotics. Zubin (157) 
stressed the need for the integral method to supplement the prevalent 
differential method of personality study and presented a technic for di- 
viding a group into subgroups of like-minded or like-structured individ- 
uals with reference to a given social criterion. In a study of the “fulcra 
of conflict,” Spencer (123) presented what is apparently a new approach 
to personality measurement. He indicated that personality conflict is a 
degree of discrepancy or incongruity between one’s self-report of his 
own characteristics and behaviors and his comparable report on certain 
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ideals and behavior of others. The “fulcra of conflict” used in this study 
are the subject’s ideals about behavior, his father’s ideals, his mother’s 
ideals, his father’s behavior, his mother’s behavior, and the behavior of 
his closest associates. 


Interest Inventories 


Several interest inventories have been published or described in the 
literature since January 1938. One of the most interesting of these is the 
Preference Record prepared by Kuder (72, 73). Information about this 
new blank was given in Chapter IV. Three interest-values inventories that 
to some extent reflect the influence of Spranger’s Types of Men and All- 
port and Vernon’s Study of Values test were reported by Maller and 
Glaser (84), by Wickert (148), and by Van Dusen, Wimberly, and Mosier 
(140). The inventory by Maller and Glaser is designed to measure four 
major types of interests or basic values: theoretic, esthetic, social, and eco- 
nomic. Test-retest reliabilities after a ten-day interval are given as .91 for 
theoretic, .93 for esthetic, .92 for social, and .87 for economic. These are 
high for test-retest reliability coefficients. Wickert’s test was planned to 
measure nine general desires or goal-values. The author reported that the 
reliabilities of the goal-values categories were too low for purposes of 
individual prediction but were high enough for the study of group re- 
lationships. The inventory by Van Dusen, Wimberly, and Mosier was 
based on Lurie’s factor analysis of Spranger’s Value Types. It consisted 
of a series of five scales designed to measure economic, theoretical, re- 
ligious, social, and esthetic attitudes. The reliability of the economic scale 
was given as .71. The reliabilities of the other scales ranged from .80 to .88. 

One of the objectives of the thirty schools participating in the Eight- 
Year Study of the Progressive Education Association is the development 
of interests. Various devices for measuring interests have been constructed 
in connection with that study. For instance, Sheviakov and Friedberg 
(120) reported the preparation of three interest inventories. One deals 
with the study of the different school subjects and the other two relate 
to extracurriculum activities and out-of-school situations. The interpre- 
tation of the interest inventories and procedures used in validating them 
are dealt with. The widespread interest in vocational guidance has led to 
the preparation of various vocational interest inventories during the last 
decade. The Strong blanks for men and for women are undoubtedly the 
best known of all these inventories. Strong (127) recently revised his in- 
terest blank for men, prepared a number of new scales, and simplified the 
scoring by reducing the range of weights assigned to the individual items. 


Evaluation of Interest Measures 


Skodak and Crissey (121) presented an analysis of the scores made by 
297 high-school senior girls on the Strong blank for women, the results 
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of which raised a question concerning whether the blank was sufficiently 
discriminative to be of value in vocational guidance. Seder (117) obtained 
the scores of women physicians and life insurance saleswomen on the 
Strong blanks for men and for women in connection with the occupations 
for which both blanks could be scored. The data indicated that both blanks 
were quite reliable. The results of factor analysis showed that the keys 
with the same names for the two blanks had similar factor loadings, ex- 
cept the lawyer’s keys. The analysis indicated that the interests of men 
and women engaged in the same occupation tend to be similar and sug- 
gested that separate occupational scales for the two sexes were not needed. 
Williamson (150) studied the validity of the Young-Estabrooks studious- 
ness scale on the Strong blank for the prediction of the marks of university 
freshmen. The correlation of .20 indicated negligible validity for this 
purpose. Kopas (70), using twenty-four of the occupations in the Strong 
blank, set up a simplified scoring procedure which requires a half hour 
or less rather than the several hours which are needed if a blank is 
scored by hand according to the standard procedure. The scores obtained 
in this way correlated from .49 to .71 with the standard scores for the 
different occupations. Although these correlations seem rather low, it was 
reported that the area of highest interest was the same in 82 percent of 
the cases. 

Darley (24) investigated the relationships of the results of the Strong 
interest blank to attitude and adjustment as measured by the Minnesota 
Scale for the Survey of Opinions, the Bell Adjustment Inventory, and the 
Minnesota Inventories of Social Attitudes. The data showed that the “atti- 
tude and adjustment tests not only derive meaning from their relations 
to the vocational interest test, but also add meaning tc it by comple- 
menting its definitions of occupational interest types.” Sarbin and Berdie 
(115) studied the relationships between the interests measured by the 
Strong blank and the values measured by the Allport-Vernon scale. It 
was found that some occupational groups showing measured interest pat- 
terns were characterized by certain profiles on the Allport-Vernon scale. 
Groups may be differentiated in this way, although individual application 
of the results would not be advisable. The constancy of the scores of 
college students on the Allport-Vernon test was investigated by Whitely 
(146). The mean scores agreed closely from one year to the next. The 
coefficients of correlation ranged from .38 to .78. With the exception 
of the results for the religious scale, the mean scores were in close 
agreement with the norms. Thorndike (131) criticized the Pressey 
Interest-Attitude Test from the standpoint of proportion of immature to 
mature items and presented data to show that a person may obtain a 
low maturity score merely because he checks all items very extensively. In 
a reply to Thorndike’s article, Pressey (99) pointed out that the test 
differentiates according to age and correlates with other measures of 
emotional maturity. 
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Technics of Measuring Interests 


E. Greene and Dahlem (44) reported a study of grouping the items of 
a vocational interest schedule into occupational divisions instead of ar. 
ranging them alphabetically as is more common. The grouping of 18 
vocational preference items under eleven headings brought about improved 
reliability (.89 as compared with .76), but caused no significant changes 
in distributions of ratings, and gave evidence of only slight halo effect. 
Rock and Wesman (106) investigated the relative efficiency of twenty 
different methods of weighting responses in the scoring keys for an in- 
terest test. They found that “reduced” and “unit” methods were consid- 
erably less efficient in separating groups than methods in which larger and 
more variable scoring weights were used. 

Flanagan (39) described a novel approach to the measurement of in- 
terests based on the Cooperative Contemporary Affairs Test which meas- 
ures the extent to which information concerning events in the preceding 
year has been acquired and retained. The profiles of relative scores were 
presented as measures of functioning interests. The validity of the measures 
was studied by comparing the results of the test with other data. This 
method is free from personal bias and wishful thinking. 


Attitudes and Opinions 
New Tests of Social Attitudes 


Hunter (62) prepared a test of social attitudes containing ninety-four 
statements divided into the following categories: Negro, war, economics 
and labor, social life and convention, government, religion, and miscel- 
laneous. The manual of directions gives the reliability of the whole test 
as .87 when predicted from the correlation of odd and even scores. The 
test may be used with college students and adults. Wrightstone (154) pub- 
lished a Scale of Civic Beliefs which is designed to measure racial atti- 
tudes, international attitudes, national political attitudes, and attitudes 
toward national achievements and ideals. The test is for use in Grades IX 
to XII inclusive. The reliability, based on the correlation of Form A with 
Form B for 252 pupils in Grades X to XII inclusive, is .897. A test of 
opinions and beliefs concerning certain social issues, known as A Survey 
of Opinion, was issued by the Committee on Evaluation Materials of the 
Institute for Propaganda Analysis (63). It is intended for experimental 
use in high schools, colleges, and adult discussion groups. There are two 
forms, each consisting of twenty-five questions so arranged that the state- 
ments in Form 2 are the paired opposites of those in Form 1. The test- 
retest reliability of the total test is reported as .90 in the manual of di- 
rections. 

Gristle (46) described the construction of a scale for measuring attitude 
toward militarism-pacifism. Allport (2) outlined the construction and 
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use of a telic scale for measuring war-producing behaviors, which included 
five types of international crises employed as hypothetical situations. Hart- 
mann (53) assembled 106 statements on controversial issues to form a 
measure of liberalism or conservatism and validated the items, using the 
method recommended by Kelley. Pace (91) pointed out the limitations 
in the customary use of opinions as indicators of attitudes and attempted 
to set up a different indicator based on what an individual says he would 
do in a variety of situations rather than what he says he believes. Data 
were presented to show that the test was sufficiently reliable and valid 
for ordinary classroom use. In a later study, Pace (92) reported a study 
of the relationships between a Situations-Response Survey and a Survey 
of Opinions designed to measure fundamentally the same liberal-conserva- 
tive attitudes. The correlation between the two tests was .894. The results 
of the study indicated that the situations-response survey was a somewhat 
more discriminative instrument than the opinion scale. 

Ferguson (37) reported a study leading to the isolation of two primary 
or independent social attitudes which may be described by or predicted 
from scales for the measurement of attitudes toward (a) war, capital 
punishment, and the treatment of criminals, and (b) reality of God, evo- 
lution, and birth control. The study was based on the administration of 
certain Thurstone attitude scales to 185 Stanford University students. In 
a later article, Ferguson (36) presented the development of scales for the 
measurement of the two primary social attitudes, which he called “Re- 
ligionism” and “Humanitarianism.” The reliabilities of these two scales 
were reported as .82 and .88 respectively. Geiger, Remmers, and Greenly 
(41) set up a scale for measuring apprentices’ attitudes toward their 
training in such a way that six “intra” attitude scales were included in 
the generalized attitude scale. Hinckley and Hinckley (58), using a scal- 
ing technic similar to that of Thurstone, constructed scales to measure the 
following attitudes: (a) attitude toward the work relief program as a 
solution to the financial depression, (b) attitude toward personal re- 
sponsibility in earning a living, and (c) attitude toward receiving relief. 


New Tests of School Attitudes 


Tschechtelin and others (138) developed general survey and diagnostic 
attitude scales to measure pupils’ attitudes toward teachers. This type of 
scale was used with 1,357 children in Grades IV to VIII. The correlation 
between Form A and Form B was .79. A questionnaire somewhat similar 
in purpose was prepared by Tenenbaum (129). This test was planned to 
measure the attitudes of children toward teachers and classmates. Spear- 
man-Brown reliability coefficients of .853 and .907 were reported. Eells 
(29) constructed a scale for the evaluation of pupils’ attitudes toward va- 
rious aspects of secondary schools. Bolton (10), using the Thurstone- 
Chave method of equal-appearing intervals, developed two comparable 
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scales for the measurement of attitudes toward mathematics. One scale 
was prepared recently for measuring teachers’ attitudes toward problem 
situations at the high-school level. The study was reported by Anderson 
(3) as one of a series of investigations at the University of Illinois. The 
scale was based on a list of technics which teachers reported they had 
used in dealing with different school problems. 


Technics of Measuring Attitudes and Opinions 


Ferguson (38) listed seven requirements of an adequate attitude scale and 
concluded that the method of equal appearing intervals satisfies a larger 
number of these requirements than any other method. Whisler (145) dis. 
cussed the reliability of attitude scales as related to scoring method and 
pointed out that there is a positive relationship between the number of 
items in an attitude scale which are accepted and the reliability of the 
scale. Lorge (80) published two articles on the reliability and consistency 
of responses to fifteen of the Thurstone attitude scales. He found that re- 
jected items (those not accepted by the person taking the test) should not 
be given as much weight as those accepted, and that the responses of in- 
dividuals aged forty or over were more reliable and more consistent 
throughout the fifteen attitude scales than were responses of persons aged 
twenty to twenty-five. Should statements be arranged in random order 
or order of descending scale values? Dunlap and Kroll (26) found that 
the means, dispersions, and reliabilities of the scales were not affected by 
the arrangement of the items. This finding led to the conclusion that the 
arrangement of the statements in descending order is preferable because 
of the greater ease of scoring. An additional finding was that if the subject 
was instructed merely to mark the three statements with which he was 
most in agreement, the scoring time was reduced without sacrifice in 
reliability. 

Stagner (124) studied the cross-out technic as a method in public opin- 
ion analysis and found that it had validity for groups of stated political 
preference and showed considerable consistency. Fauquier (35) experi- 
mented with the measurement of attitudes of delinquent and normal boys 
by having each subject write his first four associations to each of the 
words hate, fear, love, and desire. Certain qualitative differences in the 
attitudes of the groups were discovered through an analysis of the asso- 
ciations. Ojemann (99) pointed out deficiencies in scales prepared by the 
Thurstone procedure and described a revised method of scale construction 
which attempted to obtain a deeper sampling of integrated performances. 
Tuttle (139) also criticized the Thurstone technic but offered no other 
technic as a substitute. 

A question in all attitude measurement is whether there are general 
attitudes or whether attitudes are specific to a given object or situation. 
Lentz (77) investigated generality versus specificity of conservatism with 
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an instrument including 190 items sampling conservatism in six fields: 
education, religion, government, sex, nonsocial, and general. The median 
of fifteen intercorrelations was .73. It was felt that this degree of corre- 
lation supports the concept of general conservatism. Wickert (147) car- 
ried on a rather extensive study of interrelationships of general and specific 
preferences, and concluded that “the concept of general attitudes may use- 
fully be employed in psychology along with that of specific attitudes.” 


Persistence Tests and Other Measures 


The measurement of persistence has long been one of the most interesting 
and at the same time baffling phases of personality testing. It is recognized 
that many instances of disparity between ability and achievement are ex- 
plained by variation in a complex of personality factors covered by the 
term persistence. Ryans published four recent articles on the subject. In 
the first of these (111:333-53) he subjected the results of nineteen tests 
and ratings of college sophomores to a multiple-factor analysis by the 
Thurstone method and isolated a general persistence factor and an in- 
telligence factor. In a second article, Ryans (111:355-71) described the 
preparation of a group persistence test having a reliability of .82, the 
components of the test being study time, extended arm endurance, and a 
form of persistence schedule. In a subsequent article (113), he reported 
uniformly low correlations between persistence scores and scores on the 
Bernreuter Personality Inventory. In a historical review of the measure- 
ment of persistence, Ryans (112) stated that “the extent to which an 
individual will endure fatigue, discomfort, or pain, the amount of time 
he will spend studying, and the amount of time he will spend working 
at specific tasks seem to be indicative of degree of persistence” (112:736). 
He pointed out that the existence of a general trait of persistence which 
permeates all behavior of the organism has not been established. Thornton 
(132), and Thornton and Guilford (133), reported a factor analysis of 
twenty-two tests purporting to measure persistence. The analysis did not 
reveal any factor universally present but showed the presence of five com- 
mon factors, described as (a) an ability or willingness to withstand dis- 
comfort in order to achieve a goal, (b) a factor of keeping on at a task 
(plodding), (c) physical strength, (d) mental fluency, and (e) feeling 
of adequacy. 


Construction of New Rating Scales 


Kelly (67) described a 36-trait personality rating scale of the graphic 
type, each question of which was rated on a 25-division scale. The re- 
liability coefficients for the different scales ranged from .31 to .86. Norms 
based on the rating of 299 men and 299 women were set up for the scale 
and the usefulness of the scale in counseling was indicated. Pechstein and 
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Munn (93) designed a long and a short form of a rating scale of social 
maturity for use in the primary grades, a level at which good instruments 
for the evaluation of behavior are much needed. Fourteen patterns of social 
maturity were represented in the long form. Reliability coefficients were 
reported as .83 for Grade I and .98 for Grade III. Wolf (152) constructed 
a self-scoring form of the Vineland adjustment score card and compared 
results of using it with data obtained on the standard form filled out by 
the teacher. The self-scoring form differentiated somewhat more reliabl) 
between a group of high-achievement girls and a group of low-achieve- 
ment girls than the form on which teachers did the rating. 

Evjen (31) devised a behavior frequency scale applicable to school sit- 
uations. Thirty-three items are included in the rating, which is based on 
the frequency of observations of the type of activity listed. Cowell (21) 
described the development of a form for use in rating behavior trends 
of high-school pupils on the basis of statements that presented opposites 
of each behavior trend. Anderson (4) described technics for recording 
dominative and integrative contacts of teachers with kindergarten chil- 
dren. An observation blank was prepared and employed in observation 
of three different kindergarten groups. Exceptionally high reliability co- 
efficients (.95 to .97) were obtained between seventy-three pairs of con- 
secutive and simultaneous records of five minutes each made by two ob- 
servers. Dominative contacts exceeded integrative contacts for all teachers. 
A behavior rating scale for young chimpanzees was made by Crawford 
(22) consisting of twenty-two items whose average reliability, following 
the application of the Spearman-Brown formula, was .86. 


Aspects of Rating Scale Technics 


Certain studies, a number of years ago, led to the conclusion that the 
optimal number of divisions in a rating scale was seven. Champney and 
Marshall (16, 17) obtained ratings on about thirty characteristics of 
parental behavior by means of a graphic rating scale divided into various 
numbers of units from three to ninety. It was found that the reliability 
of the ratings increased markedly with increase in number of units up 
to about twelve and that there was a less noticeable increase in reliability 
up to about thirty intervals, beyond which there was a small decrease. 
Wilke (149) studied the question of whether ratings for a group of per- 
sons, based on a seven-step scale, can be adequately summarized. He ob- 
tained a coefficient of contingency of .87 between the summaries of two 
independent readers. Lombardi (79) devised a rating method which is 
unique, in that it involves.not comparison of individuals but comparisons 
of traits within an individual. Any trait among the fifty on the scale can 
be selected as the calibrator and the other forty-nine traits judged as more 
or less conspicuous than this one. The reliability of the scale was reported 
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as .82. The technic suggests interesting possibilities for investigating the 
organization of personality. 


Other Measures and Procedures 


Swineford (128) advanced a technic for the measurement of a per- 
sonality trait which is not dependent upon questionnaires, self-inventories, 
or ratings. In an objective subjectmatter test, each individual was per- 
mitted to determine for each item the number of points to be received 
for a correct answer (not more than four), with the understanding that 
if the answer was incorrect, twice the number of points selected for credit 
would be deducted. This procedure was used as a test of tendency to 
gamble. It was found that the gambling score was quite reliable and was 
independent of achievement on the same test. In a study of the measure- 
ment of social status, Zeleny (155) defined social status as the degree of 
acceptance of a person by his associates and developed mathematical 
formulas for a social-status ratio and a social-status score. The social-status 
score was later criticized by Dubin and Winch and defended by Zeleny 
(156). Rinsland (103) described an objective test for measuring teachers’ 


knowledge of the conduct and personality of children from six to eight 
years of age. 
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CHAPTER VI. 


Projective Methods in the Study of Personality’ 
PERCIVAL M. SYMONDS and ELISABETH A. SAMUEL 


Since PROJECTIVE TECHNICS have not previously been systematically re- 
viewed in this magazine the present summary will not confine itself to 
the usual three-year period. Strang (72) dealt with a number of the 
technics in connection with mental hygiene, in the preceding issue. A 
number of general reviews have been published elsewhere. Horowitz and 
Murphy (33) referred to the growing tendency to supplement paper and 
pencil technics with the use of a variety of materials and methods to 
reveal conscious and unconscious motivation, attitudes, and needs. The 
authors discussed variations in materials used, from unstructured or in- 
choate material like clay, sand, water, through semi-structured inkblots, 
to the unequivocal forms of the family dolls or conventional toys. There 
is a parallel gradation in freedom or precision of method. They considered 
the possibilities of picture tests calling for interpretation, choice, or eval- 
uation which can be used as a mirror for the child’s conception of himself 
and his social attitudes. . 

Frank (24) examined the dynamics of personality and the possibility 
of measuring it as a process. While standardized tests tell how nearly the 
individual approximates to a norm, projective technics should reveal the 
private world of meanings and feelings since they require the subject to 
organize the field, to interpret the material, and to react affectively to it. 
Frank classified responses into (a) constitutive: when the subject imposes 
a structure upon a plastic, unstructured substance such as clay, or upon 
partially structured fields like the Rorschach inkblots; (b) interpretive: 
when the subject tells what the stimulus situation, a picture, for example. 
means to him; (c) cathartic: when the subject discharges feeling upon the 
situation, as in play; (d) constructive: when the subject builds with given 
materials, like blocks, and in construction reveals some of his own or- 
ganizing conceptions. Frank gave examples of projective methods, men- 
tioning Stern’s cloud pictures, Rorschach inkblots, play, finger painting. 


expressive movements, drama, puppet shows, music, and the Thematic 


Apperception Test, showing how the subject projects the dynamics of his 
personality and so reveals “what he cannot or will not say.” 

Updegraff (76) reviewed projective technics in the study of preschool 
children and showed how they elicit the expression of the child’s funda- 
mental attitudes. Liss (45) referred to the work of Freud (25) and Klein 
(37) and pointed out that the purpose of the technics is to secure positive 
transference and to evoke material which will be used in the way in which 
the psychoanalyst uses dream material. The extensive exploratory work 


1 Bibliography for this chapter begins on page 90. 
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of Murray and his collaborators (56) shows a wide variety of projective 
technics in operation. Fifty men of college age were studied for a two and 
a half year period by twenty-eight trained investigators. Some of the pro- 
cedures used were conference, autobiography, drama, construction, Ror- 
schach, and Thematic Apperception. The writers offer a theory of per- 
sonality and a functional survey of projective technics. 


Drawing 


Recent writers in this field refer to, and in some cases review, the ac- 
cumulation of psychological literature on the significance of drawing 
ability as an indication of intelligence, of developmental stages, and as 
the basis for comparative studies. But there is also in process a progres- 
sively richer psychiatric literature which is concerned with drawing as 
expressive movement, as an easy channel for the flow of the inner dy- 
namics of personality, and as a therapeutic agent. Appel (3) used the 
drawings of children as aids to personality studies. He asked the children 
to draw their homes, the persons who lived there, their friends, three 
wishes, and so forth. He found this informative with regard to the child’s 
social setting and a helpful approach to the inner unofficially expressed 
lives of the children. Appel used the drawings primarily as a starting 
point for conversation and did not seem to be concerned with drawing 
as a function significant in itself or with the latent content as distinct 
from the manifest content. Griffiths (26), in her study of the phantasy life 
of fifty children from five to five and a half years of age, acknowledged 
drawing as a process of self-revelation and as a therapeutic technic. 
Despert (18) in her work with children at the Psychiatric Institute, New 
York, was convinced of the diagnostic and therapeutic value of drawing. 
She also pointed out the value of the motor activity per se and said that, 
contrary to the usual belief that this method is best with inhibited chil- 
dren, she found it worked best with restless children, especially those in 
whom unconscious fears were underlying apparent aggressiveness and 
overactivity. This study is one of the most helpful in providing insight on 
the interpretation of children’s drawings as revealing deeper motivation. 

Liss (44) analyzed the psychodynamisms at work in the drawing process. 
noting the function of aggression and anxiety. The diagnostic criteria 
are size, form, color, and symbols, analysis of which gives, for example, 
a picture of inner attitudes, ego evaluation, and attitude to space. Biihler 
(11) studied the performance in the Ball and Field Test of 165 children, 
varying in intelligence and in adjustment. She found that unsuccessful 
solutions—confused and involved—were given by 78 percent neurotic, 
20 percent low intelligence, and 2 percent normal. She concluded that 
this test is symptomatic and diagnostic of emotional problems in children. 
Abel (1) set up an experimental situation to study the value of the draw- 
ing of free designs, with limiting conditions, as a personality index. The 
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task was to draw a free design with nineteen straight lines and six curved 
lines, within a 4 by 6 inch rectangle. The results were that the schizophrenics 
showed meticulous adherence to instructions, the normal white and Indiay 
groups showed an absence of originality and creativeness, and the Balinese 
had difficulty in organizing the material. 

McIntosh (48) reported on the use of children’s drawings as a means 
of psychoanalysis. Six children, three boys and three girls, from six to 
thirteen years old, 1Q’s 68-126, all maladjusted, were analyzed chiefly 
through their drawings, their related associations, and the interpretation 
of the drawings and the associations. Drawing was felt to be a usefu! 
technic especially for those in later childhood and for those too old for 
the regular play technics. Spoerl (71), by sorting and matching technics, 
tried to establish the relationship between pictures and personality in a 
group of retarded children. He had drawings from eleven children, from 
seven to nine years old, and 164 judges. The first task was to put together 
pictures believed to be done by the same child. The second task was to 
identify the drawings with a personality description. Both the sorting and 
the matching tasks showed that in about 36 percent of the cases this was 
done correctly. The conclusions were (a) the drawings of a single child 
are highly consistent and easily identified, (b) personality can be judged 
from drawings. Reitman (63) used twelve line drawings showing facial 
expressions indicative of different emotions, shown to the patient who 
had to reproduce them. In the reproductions, patients depicted their own 
emotional states. Fleming (22) reported on the use of finger painting 
which had previously been developed by Shaw (67) and Shaw and Lyle 
(68), and which may be considered as midway between a play and an 
art technic. 


Play 


Wailder’s article (79) is a useful orienting introduction to play as a 
projective technic. He summarized theories of play found in academic 
psychology and proposed an explanation of the psychoanalytic theory of 
play. Academic psychology has sought to explain typical and traditional 
play in terms of atavism, mimicry, excess energy, preparation for the 
future, and in terms of functional pleasure. Psychoanalytic psychology 
is not able to supply a unitary explanation by which all play can be 
interpreted. It is a unique phenomenon which may have a number of 
determinants and various meanings. The psychoanalyst is primarily con- 
cerned with the person who plays and what it means to him. He may 
see in his play the drive for mastery, wish fulfilment, the attempt to as- 
similate by repetition overpowering experiences, transformations from an 
enforced passive role to a self-assumed active role, a leave of absence 
from reality and from the superego, and the weaving of phantasies about 
real objects. Despert (18) gave a survey of literature on play and a 
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classification of technics. Suitable playroom equipment was described and 
lay as a means of affective abreaction and of getting information. The 
June 1938 issue of Understanding the Child dealt with the child at play. 
Half the articles were concerned with the developmental and pedagogic 
implications of play activity and half with play as a diagnostic and thera- 
peutic technic. Weiss-Frankl (80) showed how play is the equivalent of 
interview analysis in the adult. Simpson (69) reported several cases and 
indicated that the first play interview frequently reveals the child’s focal 
problem. Rank (62) quoted Freud’s observations of the play of an eighteen- 
month-old boy who dramatized in his play the going away and coming 
back of the mother. To the analyst, play is an important medium of ex- 
pression—the language of the child—and not necessarily therapeutic in 
itself. 

Lowenfeld (46, 47) regarded play as the expression of the impulses and 
ideas which have been repressed from consciousness owing to their in- 
compatibility with other parts of the psyche. This “primary system” can- 
not be represented in words but can in play. The “secondary system” which 
increases in volume as the primary system decreases is cognition. If a 
child fails to express the material of his primary system so as to make 
contact with it, there is a tendency for ideas of the primary system to 
dominate him and so adaptation to life is unsatisfactory and neuroses may 
result. Lowenfeld believed that play can relieve many slight neuroses. 
Despert (18) found that latent aggressive trends could be aroused by the 
repeated use of a sharp instrument. Early forgotten memories of a hostile 
nature were reactivated, and phantasies of aggression were brought to 
consciousness. Through free association the child was helped to gain in- 
sight into his deeper motivation. Conn (13) believed that play is the key 
to the locked door of what the child feels and needs. He used toys as a 
device for the child to express his feelings. He did not attempt to interpret 
to the child. Through a third-person-conversation centered on the dolls, 
the child gave what is virtually a biography. The child’s feelings became 
desensitized by his being able to talk about them. 

Solomon (70) illustrated the use of active play, based on the work of 
Conn (13, 14, 15), and claimed that it was a short, effective method. The 
therapist was active, asking direct questions and offering ideas; the series 
of dolls and toys were operated in an assiduous fashion; the play situation 
was created for the child. The child revealed directly how he was func- 
tioning in his social environment and his feelings about the people there. 
There was little need for the interpretation of symbols as there was in 
the case of plastic materials, since the child expressed life reactions with- 
out resorting to symbolism or other repressive devices. The problem was 
projected on to a doll, and this gave the interview the objectivity of the 
third person. The therapist created the situation which the child faced. 
Cases were reported, the limitations and dangers of this method were dis- 
cussed, For example, there is the danger of getting too much information 
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on the basis of a trick and not on the basis of rapport. This method has 
been found best with children from six to ten years. Of all methods jt 
gives the quickest and the most complete picture of the child’s emotional 
life. There is no doubt as to its diagnostic value. Blanchard (9), in the 
discussion which followed, objected to the terms “active” and “passive” 
and preferred the term “controlled play” to describe Solomon’s technic. 
Despert (17) investigated personality differences in children of two to five 
years in a controlled nursery school setup. The situation and materials 
were constant, and different reactions were noted. In play with dolls, the 
child dramatized and expressed verbally his relations with his family. 
Despert emphasized the importance of supplementary information obtained 
by other means. Murphy (55) showed how in play the preschool child 
will indicate his assimilation of the pattern of his family experience. The 
literature on play as a projective technic has so far been concerned with 
material, the distinction between academic and psychoanalytic explana- 
tions of play, the value of the play interview, variations in the methodology 
(active and passive), the need for supplementary data, and warnings 
against the casual adoption of the method. 


Rorschach 


The clinical significance of inkblot interpretations was first explored in 
1911 by Rorschach (64), a Swiss psychiatrist, who had in mind a technic 
for the differential diagnosis of the insane. The results of his studies were 
published in a monograph, Psychodiagnostik, 1921. This has never been 
published in an English translation. After his death in the following year, 
a paper by Rorschach and Oberholzer_(65) was published which dealt 
with Rorschach’s study of one of Oberholzer’s patients. These writings 
are the basic classics in the field. 

Vernon’s articles (77, 78) serve as a useful introduction to the sub- 
ject. He noted the increasing interest in the test and felt that it would 
be helpful in studies of the nature and organization of personality, char- 
acter types, and mental disorders. He pointed out the dangers of the un- 
skilled investigator and the attempt to use the test as an objective measure. 
He explained that the Rorschach technic is not a test but a psychodiagnos- 
tic instrument of the play technic type and depends on the investigator. 
Vernon suggested that the Rorschach technic is of value in vitalizing the 
findings of objective tests, observations, and case histories, but is not so 
suitable as drawing or play for younger children. Vernon indicated the 
need for work on the Rorschach to establish norms, reliability, and va- 
lidity. Vernon gave an almost complete bibliography which was later 
supplemented by Piotrowski (60). In America the Rorschach technic has 
been studied and developed in three main areas. Klopfer, in New York 
City, established a group called the Rorschach Institute which publishes 
occasional papers in mimeographed form in the Rorschach Research Ex- 
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change. Beck (6) in Chicago, and Hertz (28) in Cleveland, have de- 
veloped scoring and interpretation procedures. Beck’s book (6) is useful 
since there is not an English translation of the original Rorschach pub- 
lication and since the method has been modified and developed. The author 
reviewed experiences with the Rorschach test and gave his own results. 
The book was written for the experienced student, technic was elucidated, 
and interpretation of individual responses given in detail. 

Validity—Benjamin and Ebaugh (8) criticized previous attempts to 
investigate the reliability and validity of the Rorschach test by means of 
statistical methods. They made a comparison of Rorschach and clinical 
diagnoses in fifty cases. The results showed that the Rorschach test has a 
high degree of diagnostic validity. Hunter (34) examined the value of 
the Rorschach test as a measure of intelligence and personality by com- 
paring, in the case of fifty pupils, intelligence as revealed in the Rorschach 
with the measure of intelligence given in the teacher’s estimation, by the 
Binet and Porteus Maze tests, and by the average of these two. Personality 
sketches prepared from the study of the Rorschach results were compared 
with personality sketches given by the teachers. It was found that the 
Rorschach indicates the general all-round level of functioning better than 
the Binet or Maze tests alone. 

The so-called “blind analysis” is another method of examining the va- 
lidity of the test. Hertz and Rubenstein (32) stated that the ultimate test 
of the method was the comparison of blind analyses, where the examiners 
knew only the sex and age of the subject. The writers compared two blind 
and one partially blind analysis prepared from one Rorschach record by 
three experienced examiners. The study claimed extremely high agreement. 
Comparison of the analyses with other clinical data showed that the 
Rorschach technics have a high degree of diagnostic validity. Piotrowski 
(61) reported a blind analysis of a case of compulsion neurosis. The 
Rorschach record and the analysis of it are given with the patient’s history 
and analysis of personality based on information received from the physi- 
cian. Troup (75) studied twenty pairs of identical twins, the hereditary 
similarity being established by the Rife diagnostic formula. Six judges, 
expert in the use of the Rorschach, collaborated. No high degree of re- 
semblance in temperament was found. It was concluded that the method 
had limitations in validity, reliability, and adequate norms. It was felt 
that the value of the method as a psychodiagnostic instrument depends 
on the skill of the examiner, but that need not interfere with its function 
which is not to supplant objective tests but to supplement. 

Reliability—Fosberg (23) explored the retest reliability of the Ror- 
schach results. The subjects were given the test on four occasions with dif- 
ferent instructions: (a) standardized, (b) to make the best impression, 
(c) to make the worst impression, (d) examiner asked the subject to look 
for various things in the inkblots. The psychodiagram remained recogniz- 
ably like that in the Rorschach administered in the standardized way. 
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Klopfer and Davidson (16) examined and summarized the Rorschach data 
obtained from normal children by different investigators. 

There has been a steady stream of publications dealing with the refine- 
ment and elaboration of the Rorschach technic. Monnier (51), writing on 
the present technic of the Rorschach Psychodiagnostik Test, discussed the 
technic of administration and the quantitative and qualitative evaluation 
of the results. He believed that an intelligence test should be given first. 
He objected to the classification “introvert” and “extrovert” and preferred 
the descriptive terms “kinesthetic” and “chromesthetic.” 

Scoring—Hertz (30) studied three hundred high-school subjects in an 
attempt to increase the uniformity of procedure and objectivity of scor- 
ing. Criteria were determined for scoring various factors, and frequency 
tables were compiled to provide standards of normality for certain test 
categories. Such standardization simplified administration of the test, in- 
creased the efficiency of the examiner, and gave greater objectivity to the 
scoring. Hertz made a comparison of three lists of normal details for use 
in scoring: the Hertz list, statistically determined; the Klopfer-Rickers 
list, qualitatively determined; and Beck’s list, empirically determined. The 
highest percent of agreement on this point was found between Hertz and 
Beck, but there is a wide range of agreement for various cards. The agree- 
ment was thought to be encouraging but there is a real need for further 
statistical research. Hertz studied the accepted popular response lists used 
by five different investigators. There was agreement, although variable 
factors might influence their determination. A low percent of popular 
responses was found in groups which showed low intelligence, neurotic 
trends, and behavior problems. Klopfer (39) examined the shading re- 
sponse and described four types giving a tentative interpretation for each 
type. Klopfer raised the question of the advisability of standardizing the 
Rorschach method and concluded that schematization would be incom- 
patible with this method since it would tend to lessen the examiner’s in- 
terest in the individual nuances and facets of any record. 

Klopfer and Davidson (42) prepared a 4-page record blank which 
includes instructions, space for case history summary, and for the graph 
of personality determinants, formulas for the necessary interpretive re- 
lationships, description of refined scoring symbols, and so forth. There 
is also a separate sheet of photographic reproductions of the inkblots 
which can be used for indicating the location of responses. 

Suares (73) tried to establish norms for the Rorschach responses of 
adolescents and to investigate changes during this period. The test was 
given to ninety-eight boys and girls between twelve and eighteen years. 
Twenty-one of the boys and twenty-one of the girls had been given the 
Rorschach from two to five years previously. The retest showed that the 
girls tended to become more extratensive at adolescence and the boys 
tended to become more introversive. 
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Miscellaneous—Pescor (57, 58) is writing a series of articles on the 
relationship of various personal factors to the Rorschach performance of 
476 delinquents. He found that the age factor, within the range of seven- 
teen to seventy-seven years, was of no statistical significance in the Ror- 
schach results. Certain significant tendencies were evident such as, in the 
case of older men, a greater frequency of original form and human re- 
sponses. The relationship between mental status and Rorschach perform- 
ance was found to be insignificant. Zulliger (81) used the Rorschach test 
for diagnosis and prognosis in the case of youthful thieves. He claimed that 
the test often indicates whether a youthful thief may be re-educated or 
not. Harrower-Erickson (27) suggested some military uses of the Ror- 
schach test to supplement intelligence tests in finding persons of emo- 
tional balance for positions of responsibility, to eliminate the emotionally 
unstable, to identify the simulator of mental symptoms, and to supple- 
ment the differential diagnosis of psychiatrists in cases of shell shock. 
The authors gave clinical illustrations. 

Munroe (54) found that the Rorschach technic had a function in the 
guidance of college students: (a) in the prediction of academic failures, 
(b) in suggesting whether or not a borderline student has resources 
from which improvement may be expected, (c) in planning programs and 
approaches according to the need of the individual, and (d) in giving 
a detailed and very accurate picture of the way in which the student’s 
mind functions. Munroe is extending her use of the test to cover the entire 
freshman class, the protocol being studied and interpreted as need arises 
in the analysis of any problem or development of a plan. There is an 
increasing use of the Rorschach test for diagnosis and analysis, in a wide 
variety of developmental and personality studies, with a recognizable 
emphasis on the investigation of neurotic and psychotic patients. The Ror- 
schach test is being used in clinics and institutions as a diagnostic and 


analytic instrument, but as an instrument which is still in the process of 
being tested. 


Gesture and Expression 


Estes (21) reported on six experiments in which 323 judges estimated 
the personality of fifteen male subjects from brief motion picture records 
of their behavior. These estimates were validated against criteria obtained 
from an extensive study of their personalities. Three procedures were used 
in judgment, viz., rating, checklist, and matching. The results were all 
statistically significant but varied in accuracy with the judge, the subject, 
and the aspects being judged. Subjects who were introverted were least 
accurately judged. The conspicuously well-judged traits were inhibition- 
impulsion; apathy-intensity; placidity-emotionality; ascendance-submis- 
sion. Those judges who were interested in the graphic arts or dramatics 
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were more successful than those whose dominant interests were in the 
sciences and philosophy. 


Handwriting 


The use of handwriting in the study of character and personality per- 
sists even though much of the earlier work has been discredited as being 
scientifically unsound and superstitious. Of the many discussions and 
studies which have appeared in the period under review the two following 
have been retained as worthy of serious consideration. Booth (10) used 
handwriting as an objective technic in personality testing. This is an 
approach to the person from the side of the action pattern. Alten (2) re- 
ferred to handwriting as the sum of crystallized gestures and an index of 
underlying expressive impulses. Form, size, manner of connection, slope. 
and pressure are criteria of the writer’s taste, sense of space, tempera- 
ment, and clearness of thinking. Writing permits the conscious realization 
of unconscious processes. Alten quoted the work of Allport, Erlenmeyer, 
Klages, and included a useful bibliography. 


Voice 


Moore (52) reported an investigation on voice and personality, using 
the Bernreuter Personality Inventory, self-ratings in speech, and ratings 


by speech students. Breathy voices were found in persons who were high 
in neurotic tendencies and introversion and low in dominance. The study 
indicated a possible relationship between types of voice quality, de- 
ficiencies, and personality traits, and the need for personality adjustment 
before speech therapy is possible. Caro (12) described a study compris- 
ing a half-hour broadcast of six male voices reading short prose selec- 
tions, and the listeners’ judgment of personality on the basis of these 
voices, with a report of their own personality. There was a positive re- 
lationship between self-description and listeners’ estimates of radio per- 
sonalities, the relationship being especially marked when the listener was 
a person with little education. 

Dusenbury and Knower (20) used phonograph records to express eleven 
emotional conditions while repeating letters A to K. These representations 
were judged by four groups of judges. The accuracy of judgments for 
recorded sounds was 83 percent and for facial expressions of the same 
emotional state 89 percent. Women’s judgments were 5 percent more ac- 
curate than those of men. Kelly (35) took advantage of the natural ex- 
perimental situation found in the fact that amateur radio operators rarely 
meet personally the other amateurs with whom they communicate and yet 
they form judgments of each other’s personality. A comparison was made 
between personality ratings based on voice and conversation alone, with 
ratings made by personal acquaintances. The median correlation between 
personal and amateur ratings in thirty-six traits was .22. 
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Drama 


Bender and Woltmann (7) worked out the use of puppets with disturbed 
children and showed how children produce puppet shows according to 
their emotional needs; how the group and the show gave a sanction for 
aggression and antisocial behavior and how the children distorted the 
plays, in the retelling, according to their own personal problems. 


Stories and Pictures 


Despert and Potter (19) made a systematic study to ascertain the value 
of the story as a means of investigating psychiatric problems. Three tasks 
were required: (a) popular stories to be reproduced—*“Big Bad Wolf,” 
“Goldielocks,” the story you like best of all; (b) stories made up by 
subject, any story you wish to make up, a story about a boy (or girl), 
story about a father, mother, and children; (c) story made up by physician, 
told by teacher, retold to teacher in writing, retold by psychiatrist. The 
stories made up by the subject were found to be the most provocative of 
all. Productivity was not considered an index of the intensity of phantasy- 
life. Children with lower IQ’s were less productive on the whole. The boys 
were more productive and more aggressive, and it was suggested that there 
might be a positive correlation between aggression and productivity. Re- 
curring themes were found to indicate the main object of concern or con- 
flict. Anxiety, guilt, wish fulfilment, and aggressiveness were the main 
trends expressed. The phantasies thus expressed checked well with the 
material obtained by other means. The story approach was most valuable 
when complete freedom of subjectmatter was left to the child. 

Balken and Masserman (4, 5, 49, 50) used the Thematic Apperception 
Test devised by Morgan and Murray (53) with fifty patients with various 
forms of psychoneuroses and early psychoses. They found that the phan- 
tasies so produced were in accord with or supplemented the clinical eval- 
uation of the subject. The phantasies were believed to be of value in 
psychiatric diagnosis, prognosis, and in estimating the progress of psycho- 
therapy. The authors believe that the test should be further investigated 
and elaborated as an instrument of research and as an aid in clinical 
psychiatry. Symonds (74) explored the possibilities of using the Thematic 
Apperception Test in studying adolescent personality. An analysis of the 
stories and pictures used for the investigation of phantasy showed that 
those pictures are most serviceable which have a minimum of detail, are 
vague in theme, incomplete in content, and suggest characters with which 
those telling the stories can identify themselves. 


Conclusion 


The development of these miscellaneous devices for studying the dy- 
namics of personality shows that deeper motivations, enduring attitudes, 
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and basic needs reveal themselves in separable and observable aspects of 
conduct. Given an appropriate methodology, therefore, it would seem that 
the scope of the study of personality is co-extensive with human behavior, 
Projective technics are an invitation to express in overt terms of move- 
ment, feeling, or phantasy the inner dynamics of personality. Sensitive as 
they are to what cannot be objectively measured and subject to the im. 


pact of the examiner’s personality, they often lack reliability and estab- 
lished validity. 
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CHAPTER VII 


Applications of Personality and Character 
Measurement’ 


JOHN W.M. ROTHNEY and BERT A. ROENS 


I+ rue pertop under review has been as prolific in research as the last 
period covered in the Review or EpucaTionaL RESEARCH the results have 
not been published. A sort of routine research formula, however, seems 
to have become common practice—to construct or revise an instrument 
for the purpose of solving a practical problem, to administer it to one 
or more groups, and to study by common statistical procedures the scores 
obtained. Seldom does one come upon the development of an instrument 
or the utilization of a technic which shows promise of being more re. 
warding than the rather ineffectual procedures that have been developed 
in the past. It is probable that the kind of research which has been de- 
scribed for the most part in this chapter has reached its height and will 
be superseded by other forms of personality study. 


Social and Religious Attitudes of College Students 


College students continue to be the favorite subjects of experimenters 
despite many warnings that mere availability is not a satisfactory criterion 
for selection of groups from which generalizations are to be drawn. Gilli- 
land (40) gave the Thurstone Attitude Scales on attitudes toward God 
and the church to students of three universities and three denominational 
colleges and found little difference between the student groups. Nelson 
(62) attempted to determine the prevalence of radical attitudes in four 
state universities and fourteen church affiliated colleges including 3,758 
students. According to scores on the Lentz C-R Opinionaire, the students 
on the whole were rated conservative, with the women uniformly more 
so than the men. Harper’s Scale to Measure Social Attitudes, Chant and 
Myers’ Scale to Measure Optimism-Pessimism, Whisler and Remmers’ 
Scale to Measure Morale, and a questionnaire of opinions about social 
trends in the U. S. were administered by Whisler and Remmers (101) to 
150 men and 149 women undergraduate students in psychology in order 
to investigate group morale. These students found life satisfactory and 
believed themselves happier than their families. There was evidence of a 
slight relationship between liberalism and intelligence. 

Fay and Middleton (36) concluded that the use of the Thurstone Atti- 
tude Scales will reveal reliable differences among student attitudes toward 
Communism, patriotism, constitution, law, and censorship, which also 
were found to be related to father’s occupation and size of the home 


1 Bibliography for this chapter begins on page 104. 
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town. The results of a long item questionnaire and a conservatism-rad- 
icalism test given by Lentz (54) to 409 men and women indicated, when 
the 100 who scored most radical were compared with the 100 who scored 
most conservative, that the radical group was more favorable to science 

and the arts, more imaginative, and more tolerant of the “underdog.” They 

indicated less admiration for military and religious leaders, jazz enter- 

tainers, and athletes. The conservative group was more opposed to change 

and more favorable toward maintaining the status quo. Nelson and Nelson 

(63) utilized scales which they constructed to measure radical-conserva- 

tive, religious, institutional, social, and moral attitudes and found some 

relationships among the scores obtained on those scales and the vocational 

choices of college students. An attitude inventory administered to 191 stu- 

dents by Mapheus Smith (84) indicated no significant relationship be- 

tween capital punishment and attitude toward war. 

Attitudes toward sex and family—Bernard (1) reported a study of the 
attitude of 800 university students toward sex, marriage, and the family and 
discussed the social implications of his results. In Brandon’s study (4) , 650 
college students expressed their attitudes on selected phases of child de- 
velopment, and these results were compared with attitudes of highly trained 
persons in that subject. There were marked differences in a number of 
phases between these groups. Control and experimental groups were 
selected and the latter were subjected to a carefully planned learning pro- 
gram designed to modify their attitude. Significant gains were obtained in 
certain areas. A re-examination of part of the group after a period of two 
years indicated that some of the change of attitudes still remained. 

Honesty—Bond (3) submitted a paper and pencil “honesty” test contain- 
ing sixty-nine propositions and involving a possibility of 258 choices, to 
three hundred college students. About 50 percent of the students agreed on 
57 of the 69 propositions, and on 7 of the propositions there was very little 
agreement. Schnepp (79) reported the results of a questionnaire appraisal 
of 43 practical situations concerning various phases of honesty which was 
administered to three hundred college students. Behavior which was com- 
monly approved or disapproved dominated these results. The author pointed 
out that in-actual practice their behavior would be below their “level of 
principle.” 


Emotional and Social Adjustment of College Students 


MeMorries (56) presented the results obtained by administering the Bern- 
reuter Personality Inventory to 126 entering Negro freshmen at Lincoln 
University. The scores indicated that one-third of the freshmen were mal- 
adjusted socially and emotionally. The Bell Adjustment Inventory ad- 
ministered to eighty of this group indicated that one-fourth had unsatis- 
factory home, health, and emotional adjustments. The Thurstone Psycho- 
neurotic Inventory was administered to 359 college students by McKinney 
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(55) and these results were compared with personal histories obtained 
from the subjects. The better-adjusted groups had more wholesome back. 
grounds and better “bringing-up” than the poorly-adjusted groups. After 
the Marlow Social Personality Inventory for dominance feeling was applied 
to 500 college women by Carpenter and Eisenberg (12), the Carpenter 
Family Background Schedule was administered to fifty subjects at the domi. 
nant extreme. The nondominant group indicated lower socio-economic 
status, less independence, more association with adults, girls, and older 
children, and less with parents. There was no correlation between domi- 
nance, nondominance, and emotional stability. Hayes (46) compared scores 
on the Bernreuter Personality Inventory of seventy-six women college stu- 
dents with their family positions. The results indicated that the fewer older 
siblings a student had, the less likely she was to be neurotic and the more 
likely she was to be self-sufficient and dominant. Those students without 
older siblings seemed to be less sociable, more self-confident, and less 
introverted. These findings are consistent with other similar studies. 

The Minnesota Scale for the Survey of Opinions and the Bell Adjustment 
Inventory were given to 49 Jewish and 366 non-Jewish college freshmen by 
Sukov and Williamson (94). Results of the Opinions test indicated that 
of the two groups the Jewish students were inclined to be somewhat more 
maladjusted but the Bell Inventory indicated no significant differences. Six 
Bernreuter test scores were obtained for each of one hundred white and 
Negro college girls by Eagleson (30). In only one of the traits (self-suffi- 
ciency) was there a significant difference between the two groups. In this 
trait Negro girls made significantly higher scores. M. E. Smith’s study (83) 
indicated that college students of Hawaii show more neurotic symptom 
scores on the Thurstone Personality Schedule than college students at the 
University of Chicago. The most neurotic group according to the test were 
those of Korean and part Hawaiian ancestry, followed by the Chinese, 
Portuguese, Japanese, and other Caucasians in that order. Some significance 
is attached to the fact that the latter two groups have most prestige on the 
island. 


Interests, Personality, and College Achievement 


From the Allport-Vernon Study of Values Test and the Thurstone Per- 
sonality Schedule scores obtained from 240 college women students, Pintner 
and Forlano (69) concluded that there was no evidence of relationship 
between emotional stability and conflicting interests. Pintner and Forlano 
(68) also found no significant differences among the Thurstone scores for 
high and low groups on the values test, although the low group showed 
slightly neurotic tendencies. Duffy and Crissy (25) used the same Values 
test and the Strong Vocational Interest Blank for women (scored for ten 
occupations) with 108 freshmen entering a college for women. Significant 
relationship between values scores and Strong scores were found in a 
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number of cases, but the coefficients were not high. The values scores did 
not have predictive value for academic performance nor did scores obtained 
on the American Council on Education Psychological Examination. Some 
factor analyses of the values scores were attempted and three factors were 
isolated and named “philistine,” people interest, and theoretical. 

An interest inventory was devised by Garrison (38) and administered 
to 320 students at North Carolina State College. From the results of this in- 
ventory interests of engineering, agriculture, and business students could be 
differentiated. By giving the Strong Vocational Interest Blank to 615 upper- 
class engineering students, five interest scales were constructed by Estes and 
Horn (35) which differentiated the interests of the engineering students in 
the civil, mechanical, electrical, chemical, and industrial curriculums. An 
attempt to discover the relationship between achievement and dominance 
test scores revealed to Meadow (58) that there is no significant relationship. 
The arithmetic test was one of long division and it may not have been 
realistic to 125 college women. In St. Clair’s study (77), the Bernreuter 
Personality Inventory and the Thurstone Psychological Examination were 
given to 688 college freshmen. No relationship between most personality 
traits and scholastic aptitude w:~ found, though withdrawing tendencies 
seemed to be important. Drought’s study (24) revealed no relationship 
between either the Bell Adjustment Inventory or the Wisconsin Scale of 
Personality Traits scores, on the one hand, and the difference between 
achieved grade-point averages in college and grade-point averages predicted 
from rank in a high-school class combined with a test of scholastic apti- 
tudes, on the other. 

Stump (93) administered the Almack Sense of Humor Test, American 
Council on Education Psychological Examination, the Willoughby Scale of 
Social Maturity, and the Allport-Vernon Study of Values Scale to ninety 
college students who also made self-estimates of their sense of humor. When 
sense of humor scores were correlated with the other tests, the highest cor- 
relation coefficients, .69 and .61, were obtained with the self-estimated sense 
of humor scores and the esthetic and social attitudes respectively. Height 
and weight were unrelated to sense of humor scores. The Terman-Miles 
Attitude-Interest Analysis Test was administered by Disher (20) to 556 
college women in Florida. The group did not differ in masculinity-femi- 
ninity reactions from women students in other parts of the country, but 
there was some evidence which would suggest that “as the groups became 
internally more homogeneous for various cultural factors, they tend to 


draw apart with respect to the degree of femininity in attitudes and 
interests,” 


Factors Affecting Changes in Scores 


Royer’s Personality Inventory was administered twice by Robertson and 
Stromberg (75) to the same forty-two college women within a thirty-month 
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period. The results indicated that most of the students were better adjusted 
at the time of the second administration of the inventory after they had 
been in college for two and one-half years. Jones (50) measured the at- 
titudes and changes of attitudes of college students over four years of col. 
lege attendance with respect to the relationships between intelligence, age. 
major subject, political party, religious affiliation, and liberalism. Man, 
correlation coefficients were reported. The Brown and Van Gelder study (6) 
of questionnaire replies concerning emotional reactions before examina- 
tion showed a peak of interference with performance just before and during 
the first few minutes of the examination and in the last few when some ques- 
tions are unfinished. In Weber’s study (100) forty-four college women of 
freshman grade were given the Guilford S.E.M. test, Allport A.S. Reaction 
Study and the American Council on Education Psychological Examination 
on six occasions separated by intervals of one week. The findings of this 
study can be interpreted as related to previous studies of Gatewood, Guil- 
ford, and Hunt demonstrating that schizophrenic patients are characterized 
by a high day to day variability of capacity. 

Mapheus Smith (85) reported two investigations of attitudes of students 
in an undergraduate course toward immigration and race problems. In one 
study the Bogardus technic for measuring social distance was given to 
forty-six students upon entering the course and again at the conclusion of 
the course. A second study was made with thirty-five students in which the 
Hinckley scale was used. The conclusion drawn from both studies is that 
attitude toward the Negro becomes more favorable after a semester study 
of race relations. Dexter (19) constructed and gave an attitude test to a 
group of participants in a religious conference before and after hearing 
each of four speakers. On the whole, there was little change of attitude 
noted. The use of statistical analysis alone in such research was questioned 
by Dexter. 

In a study directed by Ramseyer (70) 1,500 subjects ranging from 
seventh-graders to adults were given attitude tests before, and at several 
intervals after, viewing motion picture films dealing with the work of the 
Works Progress Administration and soil erosion. The showing of these 
pictures indicated that there were decided and persistent changes (over a 
two-month period) in the mean scores, with girls being more influenced 
by the pictures than boys. From the statistical results it would seem that 
percentile rank on the Ohio State Psychological Examination and stability 
of attitude were unrelated. Little relationship was found between informa- 
tion and attitude or between increase of information from the pictures and 
change of attitude. The subjects who were most out of sympathy with the 
subjectmatter of the films registered the greatest change of attitude after 
viewing the films. The Thurstone technic was used in the construction of 
the attitude tests. 

A follow-up study by Dyer (29) of 101 men students included in a study 
begun in 1924 by the author’s husband revealed that vocational interests 
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which these subjects expressed in college were very similar to the vocations 
they followed in later years. Sims’ study (82) of attitudes toward the TVA 
is an interesting example of the use of attitude scales for the measurement 
of changes in public opinion. Murphy and Likert (61) published a techni- 
cal discussion of methods of constructing and utilizing attitude scales and 
demonstrated their use in the study of radical and conservative tendencies. 
Darley’s investigation (17) of stability of scores on twelve scales of atti- 
tude, opinion, and adjustment showed distinct group and individual 
changes, with measured maladjustments showing more stability than 
normal social activities or generalized feelings and opinions. The author’s 
discussion of opinion stability is stimulating. 


Personality Studies of Other Adult Groups 


The Bell Adjustment Inventory and the Bernreuter Personality Inventory 
were used by Phillips and Greene (66) in a study of 173 women teachers. 
They found that married teachers made better adjustment scores, and that 
unmarried teachers obtained higher maladjustment scores as they grew 
older. Interests of the adjusted teachers seemed to lean toward social and 
out-of-door hobbies while the maladjusted teachers mentioned teaching or 
other work-type interests. Variability in types of response to personality 
questionnaires of many age, sex, and conjugal groups was studied by Wil- 
loughby (103). Dispensa (21) found no significant relationships among 
personality traits, metabolism, and intelligence for seventy-eight young 
women. Grove’s analysis (44) of factors in the personalities of mothers 
whose children had been brought to the Worcester Child Guidance Clinic 
suggested that ability to carry out plans, ability to make adjustments, 
satisfactory marital adjustments, affection for the child, absence of inferi- 
ority feelings, adequate social interests, lack of anxiety, and satisfaction 
with present conditions, were important factors in determining the treat- 
ability of mothers. 

A questionnaire intended to measure “general over-all job morale” pro- 
duced higher morale scores for selling employees than for the nonselling 
groups; also, morale scores decreased with increased length of service, 
according to Kolstad (52). An interesting study of personality patterns 
of village residents by the cluster block analysis method was reported by 
Schanck (78). The Bell Adjustment Inventory was given by Pallister and 
Pierce (64) to Scottish workers, to unemployed, and to college students 
in a Scotch industrial area. Scores were compared with the American norms 
obtained by Bell. The Scottish groups scored higher:in home and health 
adjustment and lower in social adjustment than the American groups. 
Bills’ findings (2) that scores on the insurance agent and real estate keys 
of the Strong Interest Blank were closely related to success after one year 
of selling insurance suggested that the Strong keys have some validity in 
this area. In a study by Hilgard (47) it was found that the Strong Voca- 
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tional Interest Test was a poor indicator of grades in probationary nursing 
courses and of ratings on practical work in the wards, for nurses in a San 
Francisco hospital. For such predictions intelligence test scores were more 
useful than interest test scores, though low interest scores predicted those 
who would leave training in spite of their ability to do the intellectual! 
work involved. 


Personality Studies of Adolescents and Younger Children 


A personality test constructed from items of the Bernreuter and Cowan 
tests was given under the direction of Sheehy (81) to 777 boys and girls 
between the ages of nine and sixteen. Definite personality traits were found 
that developed with age. There was also marked agreement between pupil 
self-estimates and obtained case histories. Questionnaires concerning social 
adequacy and activity and the Willoughby Personality Inventory were 
used by Engle (34) in a study of 106 high-school boys and girls having 
a mean age of about fifteen. Although the reliability of the data was ques- 
tioned by the author, it was found that pupils who have a great deal of 
social and date activity are better adjusted than others. Four aspects of 
the development of self-reliance were reported by Stott (91). A group of 
high-school and college students were matched by Engle (33) on several 
criteria (grade, sex, school marks, IQ) but chronological ages of each 
matched pair at the time they entered high school or college were kept 
at least two years apart. The Cowan Adolescent Personality Schedule, a 
social activity questionnaire, and interviews given to this group did not 
reveal any significant differences in the two groups except that those pupils 
who believed that they were handicapped by acceleration were more mal- 
adjusted than the others. Two groups of children aged nine to fifteen were 
given the Brown Personality Inventory for Children by Springer (87). One 
of the groups (327 subjects of both sexes) which came from homes of low 
socio-economic status showed significantly more instability than the other 
group (473 boys and girls). No significant relationship could be obtained 
between high neurotic scores and sex, chronological age, score on the Good- 
enough Drawing of a Man Test, or parental ratings on the Barr Scale of 
Occupational Status. 

The Yepsen Adjustment Score Card was used by Durea (28) to rate 
1,838 children from elementary school through high school during each of 
six consecutive months. No significant differences of any kind were revealed 
in the adjustment of any comparable groups such as the sexes or races. 
In a study of one hundred children of borderline and above borderline in- 
telligence, Wile and Davis (102) found that both groups were equal in the 
number of personal problems and difficulties, but that the superior group 
could adjust itself more readily. A comprehensive personality study by 
Stott (92), involving about 1,855 adolescents from farms, towns, and 
cities, revealed that the city group received the best ranks on Maller’s In- 
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ventory and other personality scales. The town group ranked lowest. Horo- 
witz and Horowitz (48) made an intensive study of the social organization 
of a small rural community in the South and studied social attitudes by 
means of tests and interviews. The chief finding was that social develop- 
ment is not closely related to mental development. Other findings and the 
methods used in this study are worth consideration by workers in this 
field. 

A questionnaire study of economic interests of adolescents by Symonds 
(95) revealed that school children of high-school age are more interested 
in earning than in saving or spending money. Finch and Odoroff (37) used 
the Strong Vocational Interest Blank for Men to study the interests of 467 
boys and girls in junior and senior high school. The results confirmed the 
evidence of Carter and Strong that vocational interests of the two sexes 
show certain marked differences. The interests measured by the Strong 
blank were well developed prior to age fourteen. Grege (42) listed the 
interests of Negro boys and girls in mixed and separate schools and sug- 
gested that the development of these interests depends to some extent on 
the kind of school attended. Stagner’s study (90), by means of personality 
tests, attitude scales, essay autobiographies, and personal interviews of the 
relationship between emotional instability and attitude toward parents, 
seemed to show that emotionality determines attitudes. Several hypotheses 
concerning the source of emotionality and attitudes were suggested. 

In a study by Thorndike (96) forty-nine gifted boys and girls filled out 
the Pressey Interest-Attitude Test. The scores obtained corresponded more 
nearly to mental age than chronological age. They indicated maturity in 
absence of fears and worries but they showed less maturity of judgment in 
interests than normal children. A study by Van Alstyne and Hattwick (99) 
presented a comprehensive analysis and follow-up, with numerous behavior 
rating scales, questionnaires, social case histories, and so forth, of 165 
children who had attended the Winnetka Nursery School. A comparison 
(in post nursery school life) of one group who showed good adjustment 
and one with less effective adjustment indicated that the less-adjusted 
groups had shown markedly more indication of poor adjustment in nursery 
school than had the well-adjusted group. A great deal of evidence was pre- 
sented to indicate that the nursery school makes for better social and emo- 
tional adjustment. A half year interval analysis of the behavior and per- 
sonality traits of children whose ages ranged from two to four and one-half 
years was reported by Hattwick and Sanders (45). Generalizations were 
drawn concerning the ages at which control, experimentation, integration, 
fanciful thinking, and the increasing role of social influences appear. 
Roberts and Ball (73) described a series of rating scales developed by 
Thurstone’s method for the measurement of attitudes involving nine dif- 
ferent aspects of personality. Attitude scales administered to 1,357 children 
in Grades IV to VIII by Tschecthelin and others (98) revealed no relation- 
ship with chronological age, mental age, or grade. 
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Problem and Delinquent Children 
In a study by Gerlach (39), the Stanford-Binet and the Cornell-Coxe 


Scales were given to sixty-one maladjusted boys between the ages of nine 
and twelve who had no mental diseases, were nondelinquent, and in good 
health. Children of the aggressive type were superior in intelligence ratings 
and obtained better scores on performance tests. The Marston I-E Scale, a 
Behavior Problem Record, and a Behavior Rating Scale were used by 
Durea (26) to measure ninety-three boys and girls in the first four grades 
to discover the effectiveness of the ratings in locating maladjusted children. 
The Loofbourow-Keys Personal Index of problem behavior yielded low 
bi-serial coefficients with ratings of advisers in junior high schools. Riggs 
and Joyal (72) attributed these results in part to the lack of validity in 
the ratings. Burt (10) made an elaborate study of eleven traits of emotion- 
ality of 500 children referred for criminal or nervous peculiarities. Factor 
analyses by several methods seem to isolate the factor of general emotion- 
ality and the two factors of aggressiveness and unpleasant emotion. The 
relationship between temperament and physical traits was not high enough 
to allow estimation of one from the other. 

Durea (27) attempted to discover personality traits which would differ- 
entiate individuals with the lowest and highest degrees of delinquency. The 
Pressey Interest-Attitude Test revealed definite differences among the 
various groups as regards circumstances considered wrong, fear and 
anxiety states, things in which interested, and traits admired in others. 
The Vineland Social Maturity Scale was administered to ninety-one de- 
linquent boys by Doll and Fitch (22) and indicated that nondelinquents 
were definitely higher in social competence than the delinquents but the 
factor of mental retardation might account for some of the difference. A 
technic for determining the degree of maladjustment of institutionalized 
male defectives was presented in a report by Brooks (5). Low positive 
coefficients were obtained when Horsch and Davis (49) computed the cor- 
relation between institutional demerits for misconduct and Bernreuter Per- 
sonality Inventory scores. Administration of the Thurstone Personality 
Schedule to 115 youthful first-offending prisoners by F. Brown (7) re- 
sulted in scores which were higher than college students’ and closer to 
schizophrenic, manic-depressive, or neurotic scores. 


Effects of a Handicap on Personality 


The Maller Personality Scale was administered by Seidenfeld (80) to 50 
tuberculous and 50 nontuberculous subjects who were paired according 
to a number of different categories including intelligence, age, and sex. 
Fifteen of the items showed significant differences between the two groups. 

A study by G. Brown (8) of 60 diabetic children and 60 of their nondia- 
betic siblings revealed no significant differences in general health, school 
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achievement, intelligence test scores, and Woodworth-Cady scores. Par- 
ents, however, reported increased excitability and irritability of the dia- 
betics after the onset of the disease. The Pintner Personality Outline, a 
questionnaire, and an intelligence test, were used by Chobat and others 
(13) in their study of 169 allergic children. The girls in the group 
seemed to be better adjusted than the boys, and the whole group showed a 
slight tendency toward introversion and submission. Intelligence tests 
indicated no deviation toward retardation or acceleration. 

A study of adjustments of blind and seeing adolescents, reported by 
P. A. Brown (9), indicated that when the Neymann-Kahlstedt Diagnostic 
Test (Introversion-Extroversion) and the Clark Revision of the Thurstone 
Personality Schedule were administered to 359 seeing high-school seniors 
and 218 blind adolescents (from ages sixteen to twenty-two) no significant 
differences except in individual items were found. According to findings 
by Springer and Roslow (89) from the Brown Personality Inventory used 
with 59 paired deaf and hearing children, the deaf children obtained 
much higher neurotic scores than the hearing children. However, when 
items specifically conditioned by the loss of hearing were eliminated 
there was no appreciable difference between the two groups. Another 
study by Springer (88) found little difference between deaf and hearing 
children in teacher ratings on the Haggerty-Olson-Wickman Behavior 
Rating Schedules. Kirk’s study (51) of ratings on the Haggerty-Olson- 
Wickman Behavior Rating Schedule of 112 children in Grades I to VIII 
who were either deaf or hard of hearing suggested that normal hearing 
children have less “problem tendencies” than the deaf and hard of hear- 
ing groups. Little difference was found between the groups in intellectual 
and physica] traits, and considerable difference was found in emotional 
traits. When the deaf and hard of hearing groups were compared with each 
other no significant differences were found. The subjects in Gregory’s 
study (43) consisted of institutionalized mentally retarded children be- 
tween the ages of thirteen and eighteen. They were divided into two 
groups—those who were deaf and those who could hear. They were then 
paired off as to age and sex and given a personality test. The deaf children 
seemed to feel insecure and withdrew from social situations. 


Personality and Various Factors 


Pillsbury (67) showed that body form of 274 students as measured 
by the Pinget Index, and introversion-extroversion as measured by the 
Guilford Questionnaire, are not related despite the use of every type of 
statistical analysis which might be applied. Cabot (11) used subjects of 
the Harvard Growth Study to examine the relationship between personality 
and physique (pyknosomes, leptosomes, and athletosomes). The results 
are typical in revealing no useful relationships among somatic types and 
personal characteristics. Darling’s investigation (18) of the relation be- 
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tween personality traits and autonomic reactions to startling sensory stimulj 
leads to a hypothesis of relationship between personality structure and auto- 
nomic functioning. Gottlober (41) studied the electro-encephalographic 
patterns of subjects classified as extroverts and introverts; extreme extro- 
verts were characterized by unusual rhythms. It is suggested that the per- 
sonality pattern and rhythm patterns are not causally related but are 
concomitants in an organismic whole. Claims that the lie detector test 
is valuable in dealing with personality problems, since it detects deception 
and thus clears the ground for readjustment, are made by Marston (57). 
It is also claimed that the more extensive use of the test will furnish a 
motive for moral education since there will be less dishonesty if its 
detection is more likely. 

Eisenberg (32) selected fifteen subjects of each sex who were at each 
extreme of dominance feeling, and ten judges examined their hand- 
writing. Judgments concerning personality were no better than chance 
expectation while judgments as to sex were a little better than seven out 
of ten. 

Drake and others (23) found slightly more relationship between self- 
rating by the Link Inventory and rating by classmates for boys than for 
girls, although none of the correlations was highly significant. The Roger 
Self-Rating Test scores of thirty-nine college girls and ratings obtained 
by their best friends did not show any consistent tendency for the self- 
ratings to result in overestimation of desirable characteristics, as reported 
by Robertson and Stromberg (74). Copeland’s study (15) compared 
reported measurements with actual measurements of height and weight, 
and showed that there were enough differences to warrant serious doubt 
concerning the validity of reports obtained from applicants for employ- 
ment. A tendency to make false reports of age was also noted. Spencer 
(86) showed that the use of self-rating questionnaires for the discovery of 
conflict is not likely to be effective when students are required to sign 
their names. When conflict is greatest deception may be high. 

Koos’ discussion (53) of observation, questionnaire, and rating technics, 
and their use in the study of personality adjustments and vocational guid- 
ance, sums up some of the major issues involved in the use of these 
methods of appraising personality. 
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CHAPTER VIII 


Statistical Methods Related to Test Construction 
and Evaluation’ 


JOHN C. FLANAGAN 


Tae DEVELOPMENT AND APPLICATION of statistical methods to problems 
of constructing tests and interpreting the scores from these tests have 
progressed rapidly since the pioneering work of such men as Galton, Pear- 
son, Thorndike, and Spearman. However, it is clear to those working 
on these problems that far from having reached a plateau this development 
is still in a period of rapid growth. The recent progress is chiefly marked 
by the substitution of more efficient and rigorous procedures for the crude 
empirical methods which sprang up during the rapid development of this 
field. The prediction of even greater gains in the near future is based on 
the relevance for these procedures of certain rapidly developing branches 
of statistical theory which are associated with factor analysis and analysis 
of variance. The rapid growth in these fields is abundantly demonstrated 
by a tabulation of the number of references related to these topics in- 
cluded in the last three numbers of the Review or EpucaTionaL RESEARCH 
which have been devoted to Psychological Tests. Toops and Kuder (166) 
' mentioned 14 studies related to factor theory and 4 studies on analysis of 
variance and covariance in their 1935 review. In 1938 the chapter by 
Cureton and Dunlap (33) reported 28 studies on factor theory and 14 
studies employing analysis of variance. Included in the bibliography for 
the present chapter are 48 studies concerning factor theory and 19 refer- 
ences related to analysis of variance. 


Bibliographies, Textbooks, and General Discussions 


This review continues the corresponding summary on statistical methods 
by Cureton and Dunlap (33). Mention should also be made of the review 
of general educational statistics by Johnson (80) published in the Review 
for December 1939, and the summary by Dunlap (42) published in an- 
other journal. Lindquist (100), Wert (178), and Van Ormer and Williams 
(172) have published books on elementary statistics. Books which cover 
somewhat more advanced topics include Snedecor’s revised edition (146) 
which is principally devoted to the application of Fisher’s analysis of 
variance to biological problems, Lindquist’s more recent work (101) which 
is important as the first book to present a detailed exposition of the appli- 
cation of these methods to educational problems, and Peters and Van 
Voorhis’ revision (129) of their earlier book, which might be described 


1 Bibliography for this chapter begins on page 123. 
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as an attempt to bring the materials of Kelley’s classic Statistical Method; 
up to date by including sections on newer topics. Two new books with less 
of an applied point of view are those by Rider (135) and Mises (114). 

There has been a noteworthy trend in the teaching of statistics toward 
increasing the emphasis on understanding and appreciation as compared 
with the development of computational skill. Two examples of this chang. 
ing point of view are the student workbooks prepared by Lindquist ( |()2 
and by Dunlap (44). Otis and Durost (126) prepared a pamphlet on the 
application of statistical methods to test scores. A short discussion of ele. 
mentary statistical methods was written by G. M. Smith (144). Holzinger 
(70) wrote a chapter for one of the yearbooks in which he gave in four. 
teen pages an overview of the development of statistical methods from 
early work on averages and the normal curve to presentday factor analysis, 
The report of the most recent international Conference on Examinations 
edited by Monroe (115) provided interesting discussions of technica 
problems of examining. Two important sources of information concerning 
books on mental measurement or research and statistical methodology are 
the volumes edited by Buros (14, 15). About half of the former and 
about a fourth of the latter are devoted to reviews of books. 


Factor Theory: Summaries and Points of View 


Recent journals have contained a constantly increasing number of 
articles concerning factor analysis. New developments, disagreements 
among the leaders, mathematical complexity, misuse of the procedure }) 
new disciples, and the lack of much tangible evidence concerning the 
utility of the technics have combined to make the topic a source of dis- 
cussion as Well as humor, as evidenced by Cureton’s paper (32). How- 
ever, factor analysis has reached the stage where summary treatments 
and general reviews have begun to appear. Thomson’s book (156) re- 
viewed and contrasted the various procedures. Holzinger and Harman 
(74) outlined briefly in this journal the principal types of factor analysis 
and now have a general book in press. 

Thurstone in a recent review (162) of current issues discussed a 
number of criticisms and set forth certain basic conceptions. The funda- 
mental postulate upon which his work rests is that “mentality . . . func- 
tions in terms of differentiable processes which do not all participate 
with equal prominence in everything that mind does” (162:204). He 
emphasized that factor analysis is not the last word but is rather explora- 
tory in character: 

The new methods have a humble role. They enable us to make only the crudest 
first map of a new domain. But . . . [they should] enable us to proceed beyond the 
factorial stage to the more direct forms of psychological experimentation in the labora- 


tory. I fear that this exploratory nature of factor analysis is often not understood 
(162: 190). 
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With reference to the nature of factors, he stated: 


The factor methods involve no assumptions whatever as to the nature of the factors. 
They may be physical or psychical, native or acquired, physiological, chemical, or 
social in character. They may have a significance only for the particular group investi- 


gated (162: 195-96). 
Elsewhere (164: 235) he stated: 


The mental faculties isolated by the factorial methods are probably not ultimates. 
They will surely break down into further elements. 


Wolfle (186), in the latest general discussion of factor methods, stated: 

Factors are produced by anything that introduces correlation into a set of variables. 
... There are as many causes of factors as there are causes of correlation. . . . If 
the subjects show considerable heterogeneity in education, in experience, or in cultural 
background, factors attributable to these differences will appear (186: 25).... 
Factors are not artifacts; factor analysis does not create them. Each factor indicates 
the operation of some systematically working cause or set of causes (186: 26). 


Wolfle’s treatise is predominantly interpretative and nonmathematical. 
It includes a bibliography of 530 references, covering the years 1928-1940. 
An outstanding event of the recent period was the symposium on the fac- 
torial analysis of ability under the auspices of the British Psychological 
Society. The papers of the participants, Thomson (154, 155), Spearman 


(147), Burt (17), and Stephenson (150), have since been published and 


provide an excellent outline of current issues, particularly as related to 
British factor theory. They reported that although the “tetrad difference” 
of the four participants had not been found to vanish completely, some of 
the “disturbers” of unanimity had proved not to be “significant.” 


Factor Analysis: Technical Developments 


An outstanding example of the power and utility of factorial methods 
is provided in two recent papers by Kelley. The first (83) contained 
a general discussion of the importance of the quantitative study of mental 
traits in a democratic society. In the second (87) Kelley proposed finding 
the activities which would yield the greatest happiness to the largest 
number and at the same time produce the greatest amount of that which 
society needs. The mathematical solution of the problem was shown to be 
an extension of Hotelling’s canonical correlations to the case of originally 
weighted variables. A timely illustration of the pertinence of the solution 
to the problems of classification occasioned by the Selective Service Act 
was presented. Thurstone (163) presented a new method for the rotation 
of axes to obtain what he has termed “simple structure.” Burt (16) 
presented a discussion of his method of analysis by sub-matrices some- 
times called the group-factor method. He also contributed a discussion 
(18) of “unit hierarchies,” the use of which he believes facilitates the solu- 
tion for principal components. Another computational procedure for ob- 
taining principal components was reported by Flood (58). He made no 
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comparisons of the time required by his procedure as compared with other 
available methods but it appears that it might be considerably longer. 
especially when the number of variables is large. 

Many methods have been suggested for determining the smallest number 
of factors which will account for a matrix of correlation coefficients. Two 
recent contributions on this point are those of Hoel (69) and Young (189), 
Wilson and Worcester discussed the conditions necessary for a solv. 
tion to be obtained in resolving five tests into two general factors (185) 
and six tests into three general factors (184). Two methods of extending 
the results of previous solutions to new data have been presented by Har. 
man (65) and Mosier (117). Thomson (153) developed a simplified 
method of estimating a specific factor and Ledermann (97) reported 
a method which is shorter than other procedures when the number of 
tests is considerably greater than the number of factors. Ledermann (9%) 
also provided the general solution for finding another matrix which 
leaves the relation of the correlation matrix and the factorial matrix un. 
changed and also preserves the matrix of factor loadings. Tucker (169) 
gave a detailed description of a method for finding the inverse of a matrix. 
A method of making an initial transformation designed to assist in rotating 
to “simple structure” was discussed by Landahl (95). In a series of papers 
(96, 157, 159) Thomson and Ledermann discussed the influence of uni- 
variate and multivariate selection on the factorial analysis of ability. 

Criteria for selection of methods—There have been a number of discus- 
sions of the problem of stability or invariance of factor loadings. Young 
and Householder (190) suggested invariance was an important criterion 
for evaluating a system of factor analysis. Mosier (118), Harsh (66). 
and Cox (28) reported that the method of rotating to “simple structure” 
provided such stability; Holzinger and Swineford (75) reported similar 
stability for the bi-factor method; and Humphreys (77) disputed Smart's 
conclusion that Thurstone’s centroid method of principal components did 
not provide stable values. On the other hand, Wilson and Worcester (183) 
questioned the meaningfulness of factors obtained from the method of 
principal components. In a reply Kelley (82) defended this procedure. In a 
later discussion Kelley (87) attacked the validity of the criterion of invari- 
ance, and suggested that a set of mental factors proposed for general use 
should aid in the understanding of distinctions which are important in the 
lives of people and in the processes of society, and that the criteria for judg- 
ing a set of factors should be the degree to which they accurately, fully, and 
economically facilitate this understanding of social living in its dual aspect 
of individual and social welfare. The difference in point of view revealed by 
these discussions is important and serves to emphasize the little appreciated 
fact that various individuals are using factorial methods with different in- 
mediate purposes. To suggest that one procedure is best regardless of the 
purposes of the investigator would appear to be naive. Criteria for the 
selection of factor methods were listed and discussed by Holzinger (72). 
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Verification and application—Interesting work has been done by Hol- 
zinger (71) and Holzinger and Harman (73) in comparing the results 
obtained from the multiple factor methods with those found by means of the 
bi-factor technic. Although differing in detail these results are notable for 
the similarity of their general findings. Spearman (148) reworked Thur- 
stone’s Primary Mental Abilities data and found them full of “g”; mean- 
while Blakey (11) re-analyzed, by means of Thurstone’s methods, the 
famous test of the theory of two factors by Brown and Stephenson and ob- 
tained four general or group factors. In two studies (29, 108) it was re- 
ported that the method of principal components would not do what 
Thurstone’s multiple factor methods were designed to do. A discussion of 
“inverted factor theory” was presented in a joint article by Burt and 
Stephenson (19) in which points of agreement and disagreement in their 
views were noted. Another interesting approach to factor problems, the 
analysis of the causal systems underlying the variation in one dependent 
variable in terms of “basis variates,” was discussed in an article on con- 
fluence analysis by Mendershausen (113). Some relations between multiple 
correlation and factorial values were discussed by Guttman (64) and 
Dwyer (47). Tucker (170) discussed correlated factors and ways in which 
a general factor might operate. Numerous applications of factor methods 
have been reported in earlier chapters of this issue. 


Analysis of Variance; Tests of Significance 


The development of the theory of testing hypotheses has continued. 
Contributions to the problem of analysis of variance and co-variance 
in multivariate problems were made by Wilks (180) and D. Bishop 
(9). Finney (50), Pitman (130), and Morgan (116) presented tests for 
the significance of the difference between the two variances in a sample 
from a normal bivariate population. Tests of significance of the differences 
between regression coefficients derived from two sets of correlated vari- 
ables were described by Yates (188). A discussion of analysis of variance 
tests was presented by Tang (151), including tables of the probability of 
failing to reject a hypothesis when a second hypothesis is true. Pearson 
(127) and Neyman and Pearson (124) discussed combining independent 
tests of significance and general aspects of the testing of statistical 
hypotheses. 

Of more interest to educational workers are the discussions. of the 
application of these methods. The books of Snedecor (146), Lindquist 
(101), and Peters and Van Voorhis (129) were mentioned earlier. Two 
monographs deserve special mention—J. H. Smith’s survey (145) of tests 
of significance and Jackson’s discussion (78) of applications to education. 
Dunlap (39) and Crutchfield (31) also discussed applications to educa- 
tional and psychological work, and Shen (141, 142) contributed two 
important papers. Deemer (34) provided an example of Shen’s generalized 
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formula for testing experimental treatments. H. M. Walker (176) discussed 
the general problem of testing a statistical hypothesis. The “epsilon tech. 
nic” was the name given by Peters (128) to a test of significance involving 
Kelley’s unbiased correlation ratio and resembling very closely analysis 
of variance. The thorough discussion of the concept of degrees of freedom 
by H. M. Walker (175) should prove of value to many who have had diff. 
culty in interpreting the treatments of the English writers. 


Correlation Formulas and Computational Methods 


Although many procedures have been presented for the calculation of 
partial and multiple correlation coefficients and regression coefficients, 
the methods described by Wren (187) and by Chauncey (25) appear t 
be economical. Chauncey reported obtaining the multiple correlation 
coefficient for a six-variable problem in thirty minutes with the aid of a 
calculating machine. Wherry (179) reported two methods of obtaining 
approximate multiple regression coefficients. Kuder (94) developed a 
method of calculating intercorrelations from the International Test Scoring 
Machine—a method which appears lengthy but is of theoretical significance. 
Flanagan (56) reported a note on using the test scoring machine to 
calculate the standard error of measurement and reliability coefficients. 
Dunlap (40) described the use of tabulating machines for estimating 
tetrachorics, and Hayes (67) presented a table for obtaining tetrachorics 
and their probable errors from the percent differences in groups. Hayes 
(68) also criticized an earlier study of the interrelations of the votes 
of legislators and proposed the use of tetrachoric correlation and factor 
analysis. 

Flanagan- (54) reported a short method of estimating the product- 
moment correlation coefficient from data at the tails of the distribution. 
In a previous study (53: 55-60) he had provided tables based on cases 
beyond one standard deviation above and below the mean, i.e., using ap- 
proximately 16 percent at each end. The recent table, however, utilizes 
the upper and lower 27 percent of the distribution since that has been 
reported by Kelley (86) as optimal for upper and lower groups in the 
study of test items. Mosier and McQuitty (122) presented similar charts 
but based on the upper and lower 25 percent and the upper and lower 
50 percent. It has been shown by Flanagan (55) in an unpublished paper 
that the use of upper and lower 50 percent groups, which is equivalent to 
the estimation of tetrachoric correlation coefficients, produces coefficients 
which are significantly less accurate than those obtained from upper 
and lower 27 percent groups, in addition to requiring almost twice as much 
tabulation time. It thus appears that many recent workers who have 
turned to tetrachorics as a short cut could have saved considerable time 
and obtained more accurate results had they used a smaller portion of 
their data. 
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A new measure of rank correlation was developed by Kendall (88). It 
was reported that the sampling distribution of the new measure is normal, 
which is not true for Spearman’s rho in the case of small samples. Since 
the new measure is also more easily computed than the latter, Kendall, 
Kendall, and Smith (89) suggested that the new measure might well 
replace that value in obtaining correlation coefficients from ranked data. 
Another suggestion for the computation of the correlation between ranks 
when they are expressed as percentile ranks was made by Bennett (6). In 
plotting the scatter-diagram he suggested that class-intervals be chosen 
in such a way as to normalize the data. Thorndike (160) warned that high 
correlations between the average of a group and another variable do not 
necessarily indicate high correlation between individuals and the other 
variable. A simplification of Thomson’s formula for the corrected correla- 
tion of initial scores and gains was given by Zieve (191). 

Other technics—lIllustrations of the use of Fisher’s discriminant func- 
tion were furnished by Travers (168) and by Lorge (105). Wilson 
(182) warned of the limitations of the formula for the sampling error of 
the median. The value of stratified sampling has long been realized by 
many workers but persons employing this method have been handicapped 
by the lack of an adequate theoretical treatment of the topic. This has 
recently been supplied by Neyman (123). 

A number of notes on shortening computational procedures have ap- 
peared. Among these are Zubin’s methods (12, 192) of determining the 
significance of differences between frequencies; a variant of these methods 
by Casanova (23) ; a short cut for the Chi-square test for “Goodness of Fit” 
by Du Bois (37) ; and a summation method by the same author (38) which 
appears to be a real time-saver in computing means and standard devia- 
tions. A useful abac for obtaining the mean deviation of the area under 
a segment of a normal curve was provided by Dunlap and DiMichael (45). 


Basic Statistical Tables and Calculation Devices 


Kelley’s new statistical tables (84) replaced and extended the Kelley- 
Wood tables of the normal curve values, which have been out of print for 
some time. Conrad and Krause (26, 91) prepared normal curve tables 
based on probable error units instead of standard deviation units. Enlow 
(48) reported on his statistical slide rule giving average times for solving 
various statistical formulas with this device. Otis (125) prepared improved 
forms of normal percentile charts. 


Studies of Types of Tests and Test Items 


Weidemann and Morris (177) reviewed the essay-type test and con- 
cluded that though much research is necessary to discover how to over- 
come its current faults there is a definite need and place for improved 
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essay tests. Ashburn (3) added one further study to the accumulated 
evidence that essay-type questions are not usually rated in a manner which 
will give consistent results even for the same rater. Jones (81) empha. 
sized that the fact that two examiners agreed closely in marking essay 
examinations could not be considered evidence that their marks were 
based on anything of importance or were accurate appraisals of the exami. 
nation. A low correlation however constitutes incontrovertible evidence 
that the different raters are not appraising the examination accurately, 

Andrew and Bird (1) confirmed the general finding that if the factors 
of time per item and item-difficulty are not controlled the completion 
of recall type items are superior to the multiple-choice or recognition 
type items as written by the average college teacher. In another study (2) 
these investigators showed that recall items are slightly more stable than 
other types though all types were found consistent in their differentiation 
of students from one administration to the next. Carter and Crone (22) 
found that new-type tests can be shortened and at the same time improved 
through a simple technic of item study and revision. They also reported 
that differences in the relative reliabilities of parts are not ordinarily 
sufficiently large to make possible an increase in the reliability coefficient 
of the total by removing the least reliable part. Cronbach (30) sug. 
gested the need for further study of the multiple true-false item. Gray (61) 
reported some success, notably in the field of language usage, in ad- 
ministering tests by means of phonograph recordings. 

In the field of personality measurement Rundquist (137) reported that 
form of statement influences response and that items, agreement with 
which indicated an unfavorable position with respect to the trait under 
consideration, were most valid. Lorge (104) found that endorsed or 
accepted statements (in an attitude scale) have a higher degree of internal 
consistency than rejected statements. Preference type of items were found 
by Kuder (93) to give consistent results even though used in quite dif- 
ferent forms (paired comparison and rank-order). An interesting device 
intended to reduce “halo” effects in trait ratings by asking the rater to 
compare the traits within an individual, one with another, rather than 
comparing him with other persons, was presented by Lombardi (103). 
Further discussion of test construction in the field of personality measure- 
ment appears in Chapter V of the present issue. 


Item Analysis 


A few years ago test-makers were deluged with methods for selecting 
the best test items. It was fashionable to work out some new function 
of the difficulties of the items in an effort to get a method superior to the 
formulas of various colleagues. Fortunately, the testing movement seems 
to have outgrown this stage and it is now clear that precise mathematical 
solutions exist for the various problems of selecting or weighting items. 
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Some time ago Toops and Royer (167) pointed out that problems of item- 
selection and item-weighting were essentially multiple regression problems. 
The use of the ordinary regression methods, however, is prohibitive in 
working with a large number of items. A few years ago Flanagan (53: Chap. 
4, 57) developed a short successive approximation method for the solution 
of multiple regression equations and similar prediction problems which 
enabled the test constructor to obtain in a reasonable length of time either 
the “best” possible combination of test items for the purpose at hand or as 
close an approximation as he wished. Richardson and Adkins (134) have 
proposed, as a short method of selecting test items, use of the regression 
coefficient indicating the weight which should be assigned the item when 
taken in combination with the total test to predict the criterion. Although 
such a procedure will not insure the selection of the “best” possible com- 
bination of items, it is the first step to be taken in approximating such a 
solution. The successive repetition of a procedure such as this was what 
was proposed by Flanagan as a method for obtaining the exact solution. 
In practice, however, it has been found that this one step beyond the first 
approximations (given by the correlation of the item with the criterion) 
adds but a negligible amount to the validity of the combination of items 
selected. 

A discussion of general considerations in the selection of test items has 
been given by Flanagan (54). In this connection, the methods of correla- 
tion, discussed earlier, which devolve upon the cases in the extremes of 
distribution of test scores, are applicable. D. Walker (174) reported an 
empirical study of the effect of the shape of the distribution of item- 
difficulties on the shape of the distribution of scores. It would seem that 
considerable light might be shed on this problem by writing out the 
formulas for the moments of the distribution of scores as functions of the 
sum of the items of which it is composed. A new “index of discrimination” 
for “evaluating test items” was added to the list by Barry (5). Diamond 
(35) suggested the use of the typewriter in tabulating data as a substitute 
for punched-cards when they are not available. In an interesting paper 
Lev (99) applied the method of analysis of variance to the problem of the 
evaluation of test items. Baker (4) published a detailed illustration of 
Lev’s method. Although it may be desirable to explore this procedure fur- 
ther, it seems clear that the problems of item-selection are essentially 
problems of correlation and prediction and that the appropriate statistics 
are therefore measures of degree of relation and not tests of significance. 


Units, Seales, and Scaling Methods 


The problem of units of measurement has long been a troublesome one 
in the field of psychological and educational measurement. Recently cer- 
tain critics, notably B. O. Smith (143) and May (112), have argued that 


measurement in the fundamental sense in which the term is used in the 
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physical sciences is not possible with present instruments for appraising 
psychological and educational behavior. Smith gave the impression that 
the whole effort at measurement in the field of human behavior had been 
rather futile and could not amount to much until the criteria for scientific 
measurement were fulfilled. There was a trace of this attitude in the dis. 
cussion by May. In a later discussion, however, May (111) suggested 
that it might be more appropriate, in the present stage of testing in this 
field, to judge the instruments in terms of engineering standards rather 
than standards of pure science. From this point of view, instead of having 
some “ten tests of measurement” involving various mathematical and 
logical points, only one “test” was needed, namely: Does the test serve the 
function that it was intended to serve? Flanagan (52) discussed the nature 
of units, and also the advantages and disadvantages of various types of 
scores now in use, in his report of the development of a system of scaled 
scores for the Cooperative Tests. He also adopted a criterion of utility as the 
major consideration. Scates (140) some time ago pointed out that even in 
the physical sciences measurement often does not depend upon equal 
units, a known zero point, and so forth. 

A novel and interesting approach to the problem of a metric for mental 
functions has been reported by Gengerelli (60). This work is too new 
to pass judgment on but the novelty of the approach should be stimulating 
regardless of the final evaluation made of the specific studies and pro- 
cedures reported. Lundberg (107) discussed the general problem of 
scaling and the measurement of attitudes; Lorge contributed a paper 
(104), discussed earlier in this issue. Dunlap and Kroll (46) presented 
evidence that placing statements in order of scale value and instructing 
the individual to check three items only, simplified the scoring without 
greatly affecting the scores obtained. A discussion of the wide variety 
of uses made of the Thomson method of scale calibration in developing, 
scaling, and studying the Vineland Social Maturity Scale was given by 
Bradway (13). The abac by Dunlap and DiMichael (45) is useful in 
scaling. 

On the basis of an empirical study, Champney and Marshall (24) re- 
ported that reading graphic rating scales to millimeters (a 100-point 
scale) gave higher reliability coefficients than were obtained when the 
scale was read to the nearest centimeter (a 10-point scale). Guilford (62) 
compared the median values of the judgments assigned on rating scales 
with the values obtained from these judgments by three scaling methods. 
The relation between the medians and the estimates obtained by the other 
methods was found to be nonlinear. The author was in doubt as to the 
implications of this finding. Guilford and Jorgensen (63) discussed the 
problem of normalizing ratings. Urban (171) discussed the method of 
equal-appearing intervals as a scaling procedure. Two simplifications of 
Thurstone’s method of successive intervals have appeared recently, one by 


R. Bishop (10) and the other by Mosier (119). 
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Problems of Weighting 


The problem of selecting the “best” weights to assign to each of several 
different measures when combining them so as to predict some criterion 
variable has been solved, at least for the case of linear combinations. The 
question as to the most appropriate weights for use when a combined 
score is desired and no criterion is available has, however, only recently 
attracted attention. A thorough discussion of this problem was provided 
by Wilks (181), who proposed three methods of assigning weights. The 
first of these was previously advocated by Horst (76) who suggested the 
use of the first factor loadings obtained from Hotelling’s method of 
analysis into principal components. The second method was the equaliza- 
tion of the correlation of each variable with the total or combined score. 
The third method involved the equalization of the increments of variance 
for the combined score as each variable is combined with the remaining 
variables. The mathematical procedures for obtaining these weights were 
described. Still another method has been suggested by Thomson (158), 
namely, the use of Hotelling’s canonical correlations to obtain the maxi- 
mum correlation of the combined scores with a combination of scores ob- 
tained from independent measurements of the same traits. Kelley (87) 
discussed certain important problems of weighting, including a comment 
on Thomson’s proposal on weighting for maximum reliability. Kelley’s 
point seems of general applicability, namely, that the weights assigned 
should depend upon the purpose in combining the variables. 

Two empirical studies of weighting have appeared recently. Stalnaker 
(149) concluded from studies of the correlations between total scores 
obtained by rather dissimilar weightings of the parts of various College 
Entrance Examination Board tests that any influence of the usual weight- 
ing factors is so small as to be insignificant. Scates and Fauntleroy (139) 
studied the effect of weights on index numbers; the conclusions, however, 
seem relevant to other weighting situations. Unlike the study by Stalnaker, 
this one found that weights did make a difference. Some of the factors 
determining the effect of weights were listed. 


Reliability Coefficients; Accuracy of Measurement 


For some years the reliability coefficient has been the most popular 
method of reporting the accuracy of measurement of a test. Ambiguities 
arising from this procedure have frequently been noted and many com- 
parisons have been made of the results obtained from various methods. 
Read (131) recently reported that changes in the method of splitting 
the items within the parts of a test into halves had little effect on the 
reliability coefficients obtained except that because of the effect of time 
limits it is not advisable to compare the first half with the last half. An 
empirical comparison of reliability coefficients obtained from test-retest, 
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split-half, and the equivalent forms methods by Remmers and Whisle; 
(132) showed that different values may be obtained from these methods. 
A similar finding was reported by Ferguson (49), who made a factor 
analysis of the half-test scores from three equivalent forms. The effect of 
such inconsistencies on the correction correlation coefficients for attenua. 
tion, and on partial correlation, were discussed by Thouless (16]), 
and appropriate correction procedures were provided. One will! find this 
discussion profitable. 

Rulon (136) reported a simplified procedure for determining the reli- 
ability of a test by split-halves which was similar to the method of Otis 
and Knollin. Flanagan (51, 52) reported the use of this type of calculation 
to obtain standard errors of measurement at various score levels for a 
given test. He also discussed the sources of ambiguity in obtaining esti- 
mates of accuracy of measurement. In another publication, (56) he sug. 
gested a short-cut for obtaining the data for such values or for reliability 
coefficients when the tests were scored on the International Test Scoring 
Machine. Mosier (121) gave a brief discussion of the concept of the 
variability of an individual score. 

An interesting new development in the calculation of reliability co- 
efficients of the split-halves type was given by Richardson and Kuder (133). 
The one of the four formulas developed which they recommend is reported 
to require about the same time as does the split-halves method and has the 
advantage of providing a unique solution. The authors pointed out that 
the method has all the other disadvantages of the split-halves procedures 
and assumes the rank of the matrix of item intercorrelations to be one; 
i.e., that all items are measuring a single general factor and specific factors, 
and that no other general factors or group factors are present. Jackson (79) 
proposed a new measure of accuracy of measurement which he called the 
“sensitivity” of a test. This value is reported to be the ratio of the stand- 
ard deviations of the true scores and the errors. It is difficult to see any 
advantage which this value has over the standard error of measurement. 
In a later discussion (78) of the relationship between the sampling unit 
and the estimates of reliability of a test the author employed the probable 
error of measurement. 

The probable error of measurement is ordinarily estimated for an in- 
dividual by indirect means. Kreezer and Bradway (92) reported a study 
in which they retested each person several times and calculated the prob- 
able error of measurement directly from the distribution of scores obtained 
for that individual. Certain factors such as gains due to actual growth were 
controlled to a considerable extent by using mature feeble-minded in- 
dividuals. The factor of spurious correlation between specific factors due to 
the use of the same form on each retest was not discussed. Lorge and Mor- 
rison (106) reported that for a group of about one hundred individuals 
who were given two forms of five attitude scales two weeks apart, the 
principal component scores beyond the first factor were not reliably 
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determined. Other variations in retest scores were reported in Chapter 
VIL. The reliability of the essay test was reported on earlier in the present 
chapter. 


Validity and the Interpretation of Scores 


It seems strange that the problem of validity, which is the prime essential 
in all testing, has not received more adequate treatment. Only a few scat- 
tered papers have discussed this topic even briefly in the past three years. 
A valuable starting point has been provided by Burton (20) in the form of 
an outline definition of validity. Ryans (138) discussed the use of the 
methods of factor analysis in validating tests. Carr (21) emphasized the 
dependence of validity on reliability, and suggested it might be more 
satisfactory to report the type of relation to which the coefficient applied 
rather than to try to classify it as a measure of reliability or validity. 

Judgments and ratings play an important part in the procedures of 
validation but analyses of such judgments have been all too few. Preston 
(131) pointed out that in increasing the size of the group whose average 
judgment is used, the reliability of the judgment may be increased in the 
sense that it would agree more closely with another similarly obtained 
average judgment. However, this does not at all guarantee that the average 
judgment is becoming more valid, since it may merely be more accurately 
measuring something negatively correlated with the trait in question. Large 
numbers will not serve as a substitute for accurate analysis and penetrating 
insight into the behavior in question. A review of the recent literature con- 
cerning the neurotic questionnaire with special reference to its validity 
was made by Mosier (120). There has been a real need for easily under- 
stood procedures for describing test validity. Taylor and Russell (152) 
provided tables giving the proportion of persons who will be satisfactory 
when three conditions are known: (a) the proportion of satisfactory indi- 
viduals among the candidates, (b) the proportion of candidates to be 
selected, and (c) the correlation of the selective battery of tests with 
the criterion. 

Norms—Courtis (27) emphasized the need for a measure of the effort 
put forth on a test and the desirability of measuring results in terms of 
growth. Main and Horn (110) presented evidence that the use of grade 
norms which were not based on really average or unselected children of a 
given age, with specified amounts of school experience, caused the average 
child to appear “maladjusted.” Kelley (85) provided a discussion of the 
difficulties in interpretation of the ordinary age and grade norms and 
proposed the use of norms describing the modal age group in a given grade. 
He termed these “ridge-route” norms since they follow the ridge of the 
frequency distribution. The system of Scaled Scores developed by Flana- 
gan (52) for the Cooperative Tests similarly provides norms for specified 
age groups in a given grade who have had a particular amount of instruc- 
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tion. Other problems in the derivation of norms have been discussed }y 
the same author (51). Kent (90) has provided a discussion of the value of 
local norms for an institution. 

The problem of “practice effect” has not received much attention in the 
recent literature. McIntyre (109) reported that in standardizing intelli. 
gence tests in Australia a fairly large gain attributable to previous experi. 
ence with the test was made even after a year. It seems likely that with the 
increased use of tests in this country, students will become so “test. 
wise” as to render the advantage gained from having taken a similar form 
of the test practically negligible. 


Problems of Administering and Scoring Tests 


A review of this field prior to 1938 was made by Frutchey (59). In the 
field of test administration, Dickenson (36) reported a plan for preventing 
copying by having students code their answers according to a simple 
pattern word which would differ for adjacent students. A problem has 
arisen because some scoring devices are such as to prevent the student from 
changing his answer if he feels his first choice was incorrect. Berrien (8) 
found that students in psychology classes improved their scores by making 
changes. 

Three studies have been reported recently concerning the effect of the 
method of indicating responses to test items. Tireman and Woods (165) 
found that when the same test was retaken by students by underlining their 
answers, they were able to improve their scores over those made by indi- 
cating their answers in the margin six weeks earlier. It is difficult to de- 
termine how much of this result may be attributed to the method of respond- 
ing and how much to practice effect and learning. Votaw and Danforth 
(173) administered a 50-item achievement test in three ways: (a) placing 
the number of the selected answer in a marginal space, (b) underlining the 
selected answer, and (c) placing a check. mark in the appropriate square 
or space on a separate answer sheet. Students poor in a test involving fol- 
lowing directions for putting numbers in squares did less well on (c) than 
they did with the other methods. It is not clear how much of this might be 
attributable to their not understanding the directions. The students took 
18 percent longer for methods (a) and (c) than for method (b). In this 
comparison also it is difficult to estimate how much time was lost in getting 
the directions. The most extensive experiment in this field was made by 
Dunlap (41). Using three thousand students in Grades IV to VIII he com- 
pared articulated and nonarticulated answer sheets, repetitive and serial 
numbering, underlined score when the answer is also marked elsewhere 
and just marking elsewhere, and marking on the margin versus marking 
on a separate answer sheet. In general the differences between the various 
procedures were not large. 
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We may conclude from this and other accumulated evidence that prac- 
tically all students can learn to mark their answers on margins or separate 
answer sheets if given an opportunity to practice, but that any new 
method of administration may cause some variations in scores. The person 
interpreting test-scores must usually keep in mind the disturbing possibility 
of serious scoring errors if the tests are being scored by untrained clerks or 
by teachers. Dunlap (43) reported many scoring errors when teachers 
scored two-choice items in which the answers were indicated in the margin. 
They also have difficulty when two words are to be underlined in the text. 
A short method of scoring the Bernreuter Personality Inventory was de- 
veloped by Bennett (7). This was described in Chapter V. 

In summary it is gratifying to note that the number of problems for 
which rational rather than empirical solutions are being obtained appears 
to be definitely increasing. The most encouraging trend is that technics 
are gradually being developed which are appropriate to the complicated 
situations with which the study of human behavior is concerned. 
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birth and, 27; sex differences, 34 
Intelligence tests, 5, 9; adult, 9, 18; ap- 
plications, 25; culture-free, 21; effect 
of practice, 19; evaluation, 11, 13, 16; 
group, 15; individual, 9; nonverbal 
(performance), 17; standardization, 10 
Interests, 96; measurement, 44, 66 
International relations, attitudes, 68 
Introversion-extroversion, 63, 65, 86, 103 
Item analysis, 65, 116 


Judgments, 121 


Laterality, 48 
Liberalism-conservatism, 69, 70, 94 
Lying, 51; lie detector, 104 


Maladjusted and problem children, 102 

Marriage, attitudes toward, 95 

Masculinity-femininity, 97 

Mathematics, prediction, 43 

Measurement (mental), criticisms, 117; 
history, 71; philosophy, 117; unit of, 
117; see also intelligence tests; person- 
ality measurement; tests and scales 

Mechanical aptitudes, factor analysis, 47; 
measurement, 47 

Medicine, aptitude, 50 

Morale of employees, 99 

Motion pictures, as observation records, 87 
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Motor abilities, 48; norms, 48; sex differ- 
ences, 48 

Music, factor analysis, 45; intelligence 
and, 32, 45; measurement, 44; predic- 
tion, 44 


Nature and nurture, 6, 30 

Negroes, 95, 101; intelligence, 25 

Nomographs, 15, 115 

Norms, 46, 121 

Nursery schools, effect on personality and 
adjustment, 101 

Nursing, aptitude, 50, 100 

Nutrition, intelligence and, 34 


Objective tests, 116; reliability, 116 
Observation, of behavior, 72 
Occupations, and intelligence, 28 


Perseveration, 51 
Persistence, measurement, 71 


Personality, development, 100; dynamics, 


80; estimation of, 87; factor analysis, 
58, 62, 64, 71, 97, 102; family factors 


and, 96, 100; handicapped children, 
102; health and, 102; intelligence and, 
28, 99, 100, 102; integration, 65; pat- 


terns, 65, 72, 80, 99; physique and, 103; 
voice and, 88 


Reliability, 116, 119; of attitude scales. 
70; of essay tests, 116 

Relief (financial), attitudes, 69 

Rorschach technic, 84; validity, 85: re. 
liability, 85; scoring, 86 

Rural social and economic conditions, ]0] 


Salesmanship, 51 

Sampling, 115, 

Scaling, 117 

Score cards, 72 

Scoring, 122; machine, 7, 114; of essay 
tests, 116 

Self-report, 104; validity, 62 

Sex, attitudes, 95 

Social adjustment and behavior, 58, 100: 
factors affecting, 95 

Social maturity, 102; measurement, 72, 
118 

Social status, measurement, 73 

Socio-economic status, 100; and _intelli- 
gence, 28 

Sound recording, 88 

Statistical methods, 109 

Stories, 89 


Tabulating, 117 
Tabulating machines, 114 
Teachers, attitudes, 70; married women, 


Personality measurement, 6, 57, 80; appli- 
cations, 94; bibliography, 57; evalua- 
tion of tests, 59, 70: scoring, 62, 68; 
stability of scores, 59, 63, 67, 97; tests 
and inventories, 57; validation, 58, 60 

Physique, 103 

Pictures, appreciation of, 46; for person- 
ality analysis, 89 

Play, 82 

Prediction, of achievement, 31, 43, 59; of 
college success, 17, 31, 67, 87, 97 

Primary mental abilities, 20 

Probable error, 120 

Professional aptitudes, 49 

Projective technics, 80 

Propaganda, 68 

Psychophysical measurement, 118 

Public opinion, 99; measurement, 70 

Pupils, attitudes, 69 


99; personality, 72, 99 

Teaching success, analysis of, 50; predic- 
tion, 61 

Tests and scales, administration, 122; 
bibliographies, 5; construction, 70, 109; 
response form, 122; trends, 5; see also 
essay tests; objective tests; measure- 
ment 

Tests of significance, 113, 115 

Twins, 85 


Validity, 121; self-report, 62; technics of 
validating, 42, 49, 121 

Vision, measurement, 46; sex differences, 
47 

Visual defects, color blindness, 47 

Vocabulary, as indication of interests, 44 

Vocational adjustments and success, meas- 
ures of proficiency, 42 

Vocational aptitudes, 49, 99 

Racial differences, in mental ability, 25 Vocational interests, 66; and choices, 43, 


Rating, reliability, 118; scales, 71, 116, 96, 98, 101 
118 Voice, and personality, 88 


Reading, intelligence and, 33; visual abil- 
ity and, 47 
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Racial attitudes, 96, 98 


Weighting, 119 





